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ALLOCATING NETWORK BANDWIDTH 

CLAIM TO PRIORITY 
This application claims priority from U.S. Provisional 
Application No. 60/154,152, entitled "Frame-Based Fair 
Bandwidth Allocation For Input/Output Buffered Switches", 
filed on September 15, 1999. 

TECHNICAL FIELD 
This invention relates to allocating network bandwidth 
among data flows in a network device. 

BACKGROUND 

Devices in a network use large switching fabrics capable 
of supporting traffic with a variety of quality of service 
(QoS) requirements. A high performance, multiple QoS (MQoS) 
device achieves high throughput and fair resource allocation 
while providing control over cell loss and delay for 
individual flows or groups of flows. 

Although many scheduling mechanisms suited for an output 
buffered switch architecture have been shown to provide MQoS 
support, the cost and complexity of the output buffered 
fabric is prohibitive for large switch sizes. The crossbar 
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switch fabric used in input buffered switches scales to 
terabit per second (Tbps) speeds; however, the scheduling 
control required to match input and output links over a small 
time interval (internal cell slot) is complex. While recent 
reductions in matching process complexity have increased 
throughput, QoS is still an issue. 

A hybrid architecture provides a compromise between a 
costly output buffered switch fabric and the scheduling 
complexity and QoS management difficulty associated with the 
input buffered switch. A hybrid architecture typically 
contains a small amount memory in the switch fabric/outputs 
with additional memory at the inputs. The memory is used to 
buffer data flows passing through the switch. 



15 SUMMARY 

A scheduling process is described for an input/output 
buffered network device, such as a switch, that provides 
relatively high utilization, rate guarantees, and flow-level 
fairness. Bandwidth across the switch fabric is allocated to 

20 groups of flows over a small fixed interval, or frame. The 

scheduling processes described herein provide QoS support for 
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switch fabrics containing relatively small buffers and simple 
First-In First-Out (FIFO) or priority based scheduling. 

In general^ in one aspect, the invention is directed to 
allocating bandwidth to data traffic flows for transfer 
5 through a network device. This aspect features allocating 
bandwidth to a committed data traffic flow based on a 
guaranteed data transfer rate and a queue size of the 
committed data traffic flow in the network device, and 
allocating bandwidth to uncommitted data traffic flows using 

10 a weighted maximum/minimum process . 

This aspect may include one or more of the following. 
The weighted maximum/minimum process allocates bandwidth to 
the uncommitted data traffic flows in proportion to weights 
associated with the uncommitted data traffic flows. The 

15 weighted maximum/minimum process increases bandwidth to the 

uncommitted data traffic flows in accordance with the weights 
associated with the uncommitted data traffic flows until at 
least one of the uncommitted data traffic flows reaches a 
maximum bandwidth allocation. The weighted maximum/minimum 

20 process allocates remaining bandwidth to remaining 

uncommitted data traffic flows based on weights associated 
with the remaining uncommitted data traffic flows. The 
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bandwidth is data cell slots. The bandwidth is allocated to 
the data traffic flows in discrete time intervals. 

In general, in another aspect, the invention is directed 
to allocating bandwidth to data flows passing through a 
5 network device. Each of the data flows has an associated 

weight. This aspect of the invention features increasing an 
amount of bandwidth to the data flows in proportion to the 
weights of the data flows until one port through the network 
device reaches a maximum value, freezing the amounts of 

10 bandwidth allocated to the data flows in the one port, and 
increasing the amount of bandwidth to remaining data flows 
passing through the network device in proportion to the 
weights of the remaining data flows. 

This aspect of the invention may also include increasing 

15 the amount of bandwidth to the remaining data flows until 
another port through the network device reaches a maximum 
value, freezing the amounts of bandwidth allocated to the 
data flows in the other port, and increasing the amount of 
bandwidth to remaining data flows passing through the network 

20 device in proportion to the weights of the remaining data 

flows. One or more of the data flows is assigned a minimum 
bandwidth and the amount of bandwidth allocated to the one or 
more data flows is increased relative to the minimum 
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bandwidth. The bandwidth may be allocated to the data flows 
in discrete time intervals. 

In general, in another aspect, the invention is directed 
to a method of allocating bandwidth to data flows passing 
through a network device. This aspect of the invention 
features allocating a predetermined amount of bandwidth to 
one or more of the data flows, and distributing remaining 
bandwidth to remaining data flows. 

This aspect of the invention may include one or more of 
the following features. The remaining bandwidth is 
distributed to the remaining data flows using a weighted 
maximum/minimum process. The weighted maximum/minimum 
process includes increasing an amount of bandwidth to the 
remaining data flows in proportion to weights associated with 
the remaining data flows until one port through the network 
device reaches a maximum value. The weighted maximum/minimum 
process may also include freezing the amounts of bandwidth 
allocated to the remaining data flows in the one port, and 
increasing the amount of bandwidth to still remaining data 
flows passing through the network device in proportion to 
weights of the still remaining data flows. 

In general, in another aspect, the invention is directed 
to allocating bandwidth to data flows passing through a 
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network device. This aspect features determining a character 
of the data flows, and allocating bandwidth to the data flows 
in accordance with the character of the data flows. The 
bandwidth is allocated to data flows according to which data 

5 flows have a highest probability of using the bandwidth. The 
character of the data flows may include peak cell rate, 
likelihood of bursts, and/or average cell rate. 

In general, in another aspect, the invention is directed 
to allocating bandwidth to data flows passing through a 

10 network device. This aspect features allocating the 
bandwidth using a weighted maximum/minimum process. 

This aspect may include one or more of the following. 
The weighted maximum/minimum process includes assigning 
weights to the data flows, and allocating the bandwidth to 

15 the data flows according to the weights. Allocating the 
bandwidth according to the weights includes increasing an 
amount of bandwidth allocated to each data flow in proportion 
to a weight assigned to the data flow, and freezing the 
amount of bandwidth allocated to a data flow when either (i) 

20 an input port or an output port of the network device reaches 
a maximum utilization, or (ii) the data flow reaches a 
maximum bandwidth. 
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This aspect may also include increasing an amount of 
bandwidth to remaining data flows passing through the network 
device until either (i) another input port or output port of 
the network device reaches a maximum utilization, or (ii) one 
of the remaining data flows reaches a maximum bandwidth, 
freezing an amount of bandwidth allocated to the remaining 
data flow that has reached a maximum bandwidth or to the 
remaining data flow passing through an input or output port 
that has reached a maximum utilization, and increasing the 
amount of bandwidth to still remaining data flows passing 
through the network device in proportion to weights 
associated with the remaining data flows. 

This aspect of the invention may also include allocating 
a predetermined amount of bandwidth to one or more of the 
data flows, and distributing remaining bandwidth to non- 
frozen remaining data flows by increasing an amount of 
bandwidth allocated to each remaining data flow in proportion 
to a weight assigned to the remaining data flow, and freezing 
the amount of bandwidth allocated to a remaining data flow 
when either (i) an input port or an output port of the 
network device reaches a maximum utilization, or (ii) the 
remaining data flow reaches a maximum bandwidth. 
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After all of the data flows passing through the network 
device are frozen, the remaining bandwidth may be distributed 
at an output port to data flows passing through the output 
port. The remaining bandwidth may be distributed in 
5 proportion to weights of the data flows passing through the 
output port and/or according to which data flows have a 
highest probability of using the bandwidth. The bandwidth is 
allocated/distributed in discrete time intervals. 

In general, in another aspect, the invention is directed 
10 to allocating bandwidth to data flows through a network 

device. This aspect features allocating bandwidth to the 
data flows using a weighted max/min process. The amount of 
bandwidth allocated to data flows passing through an input 
port of the network device is greater than an amount of data 
15 that can pass through the input port of the network device. 

In general, in another aspect, the invention is directed 
to allocating bandwidth to data flows passing through a 
network device. This aspect features allocating bandwidth to 
data flows passing through input ports of the network device 
20 using a weighted max/min process. Allocating the bandwidth 

includes increasing bandwidth allocated to data flows passing 
through each input port in proportion to a weight assigned to 
the data flow passing through the input port, and freezing an 
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amount of bandwidth allocated to a data flow passing through 
an input port when either (i) the input port reaches a 
maximum utilization, or (ii) the data flow reaches a maximum 
bandwidth. This aspect may further include continuing to 
5 increase the bandwidth allocated to non-frozen data flows in 
proportion to weights of the data flows until an amount of 
bandwidth is frozen at all of the data flows. 

In general, in another aspect, the invention is directed 
to allocating bandwidth to data flows through a network 
5l0 device. This aspect features allocating bandwidth to the 
5 data flows passing through output ports of the network device 

using a weighted max/min process. 
*° This aspect may include one or more of the following 

5 features. Allocating the bandwidth includes increasing an 

J:15 amount of bandwidth allocated to data flows passing through 
O each output port in proportion to a weight assigned to a data 

flow passing through an output port, and freezing the amount 
of bandwidth allocated to the data flow passing through the 
output port when either (i) the output port reaches a maximum 
20 utilization, or (ii) the data flow reaches a maximum 

bandwidth. This aspect may also include continuing to 
increase the amount of bandwidth allocated to non-frozen data 
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flows in proportion to weights of the data flows until the 
amount of bandwidth allocated to all data flows is frozen. 

In this aspect, the amount of bandwidth is allocated to 
the non-frozen data flows until the bandwidth reaches a 
predetermined maximum amount. After the amount of bandwidth 
assigned to all output ports is frozen, the remaining 
bandwidth is distributed at an output port to data flows 
passing through the output port. The bandwidth may be 
distributed in proportion to weights of the data flows and/or 
according to which data flows have a highest probability of 
using the bandwidth. The bandwidth is allocated/distributed 
in discrete time intervals. 

This aspect may also include allocating bandwidth to 
committed data traffic based on a guaranteed data transfer 
rate. Bandwidth is allocated to the committed data traffic 
in response to a request for bandwidth such that any request 
that is greater than, less than, or equal to the guaranteed 
data transfer rate is granted. The bandwidth is allocated to 
uncommitted data traffic and, for committed data traffic, 
bandwidth is allocated based on a guaranteed transfer rate. 
Remaining bandwidth, not allocated to the committed data 
traffic, is allocated to the uncommitted data traffic. 
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In general, in another aspect, the invention is directed 
to transferring data traffic flows through a network device. 
This aspect features transferring a committed data traffic 
flow through the network device using a guaranteed bandwidth, 
5 determining an amount of bandwidth that was used during a 

previous data traffic flow transfer, and allocating bandwidth 
in the network device to uncommitted data traffic flows based 
on the amount of bandwidth that was used during the previous 
p data traffic flow transfer. Allocating the bandwidth may 

PlO include determining a difference between the amount of 
^ bandwidth that was used during the previous data traffic flow 

^! transfer and an amount of available bandwidth and allocating 

L, the difference in bandwidth to the uncommitted data traffic 

n flows, 

5 15 other features and advantages of the invention will 

become apparent from the following description and drawings. 

DESCRIPTION OF THE DRAWINGS 
Fig. 1 is a block diagram of a switch. 
20 Fig. 2 is a plot showing bandwidth allocation. 

Fig. 3 is a table showing request/grant phases in a 
bandwidth allocation process. 

Fig. 4 is a timeline showing cell delay. 
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Fig. 5 is a graph showing delays for data traffic types. 

Fig. 6 is a graph showing delays for data traffic types 
that have received bandwidth allocated according to the 
request/grant process described herein. 

Fig. 7 includes two bar charts that show a per-flow 
throughput comparison for uncommitted data traffic. 

Figs. 8 to 20 are block diagrams showing switch 
input/output node/port configurations and data flows passing 
through the input output nodes/ports. 

Fig. 21 is a block diagram for a process C+ Central Rate 
Processing Unit (CRPU) (process C+ described below) . 

Figs. 22 and 23 are graphs showing bandwidth/rate {R{x)} 
allocations . 

Fig. 24 is a block diagram showing one implementation of 
the Distribute Subroutine Module (DSM) . 

Fig. 25 is a block diagram and cut-away view of a device 
for performing the bandwidth allocation processes described 
herein. 

DETAILED DESCRIPTION 
In a network, data may enter at certain points and exit 
at certain points. Assume that a path through the network is 
set up for each entry/exit pair. Each such path is called a 
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flow. In a packet switched network it may be desirable to 
prevent some subset of the network links (also called ports 
or nodes) from being overwhelmed with data. The sums of the 
rates of data flows passing through these links should be 
limited to less than or equal to the capacity of the links. 

Some switches can be modeled in the foregoing manner. 
They include hybrid architecture switches having a number of 
input ports with significant buffering and a number of output 
ports with very limited buffering. If data is flowing into 
an output port faster than data is leaving the output port, 
data will build up in the output port buffer until the buffer 
overflows and data is lost. The processes described herein 
determine allowed rates for each data flow in the network to 
reduce data loss and maintain a predetermined QoS for a 
network. 

Fig. 1 illustrates a hybrid switch architecture. In 
hybrid architectures such as Fig. 1, flow control between the 
input 10 and output 11 links is used to ensure that the 
switch fabric memory is not overloaded. Flows originating 
from the same input and destined to the same output are 
assigned to one group referred to as a virtual link (VL) . 
The techniques described herein can also be used with 
connections or sets of connections. 
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Fig. 1 shows an NxN switch where each input contains N 
(N>1) VLs with a scheduling hierarchy that includes one link 
scheduler, choosing among VLs, and N VL schedulers, choosing 
among flows in a VL. The present approach to providing MQoS 
5 support in a hybrid switch unites flow control and fabric 

scheduling functions into a rate assignment that is valid for 
a small fixed time interval, or frame. For every frame, each 
VL requests some number of cell slots. Cell slots in this 
case refers to data cells being transferred and corresponds 
J;{lO to data flow bandwidth. A Virtual Link Rate Assignment 
!jl (VLRA) process grants cell slots to VLs according to 

U predetermined delay, throughput and fairness goals. The 

£ input schedulers distribute the granted cell slots to 

O individual flows according to their QoS requirements. 

y^^5 Relatively high utilization, fairness and delay 

2 guarantees are achieved through the use of guaranteed rates 

and WFQ techniques. Since the present approach uses small 
fabric queues, the fabric schedulers can be simple FIFOs or 
priority queues, while the complexity is shifted to the rate 
20 assignment process. By updating the rate assignment over 

intervals that are longer than one cell slot, communication 
bandwidth and scheduler complexity are reduced from that 
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required by an input buffered switch. Described below are 
several VLRA processes. 

1 . 0 Explicit Rate Assignment Process 
5 In an input buffered switch, input and output links are 

matched based on queue information sent from the inputs. For 
a simple matching process that does not consider QoS, one bit 
is used to indicate a backlogged/idle status for the 
input /output pair. When the number of cell slots per frame, 
i.OlO Tf is greater than one, more information than a 

backlogged/idle status is needed. For a frame length of 
H small duration the actual queue size of the virtual links can 

+^ be used as state information. For longer frame lengths, 

H queue size prediction may be necessary to achieve acceptable 

>.15 delay. 

Pi The explicit rate assignment (ERA) process keeps two 

queue size states per VL: a committed traffic queue state 
(CBR {Constant Bit Rate}, rt-VBR { RealTime-Variable Bit 
Rate}, nrt-VBR {Non-RealTime-Variable Bit Rate}, and the 
20 minimum rate of ABR {Available Bit Rate} and GFR {Guaranteed 
Frame Rate}), and an uncommitted traffic queue state (UBR 
{Unidentified Bit Rate}, ABR and GFR above the minimum rate). 
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Each VL has a guaranteed rate, measured in cell slots 
per frame, to support delay and cell loss bounds for its 
constituent flows. Fractional cell slots are accumulated 
from frame to frame to support a large variety of rates. The 
5 guaranteed rate is large enough for the VL scheduler to give 
sufficient delay guarantees to real-time traffic and to 
ensure average rate throughput for non-real-time committed 
traffic. The ERA process operates in two phases. 

10 1.1 First Phase 

In the first phase, the ERA process allocates cell slots 
to the committed traffic in a request/grant format based on 
the guaranteed rates and the queue sizes. For the VL between 
input i and output j (VLij) , let represent the guaranteed 

15 rate for the committed traffic and let qy[n] represent the 

amount of committed traffic queued at the beginning of update 
interval n. Let Fy[n] represent the fractional cell slots 
available for committed traffic in interval n. Assuming that 
^/;.^"^''<r at both the input and output links, the ERA process 

lor J 

20 grants coinmitted traffic cell slots, ^^^[" + 1], for VLij in 
interval n+1 by 
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where Qfj[n] is the requested number of cell slots. Qfj[n\ can 
be defined in many ways. One definition that yields minimal 
complexity is Qfj[n]^ qfj{n\ . For better throughput the 

committed traffic cell slots for frame n, r^[n\ , may be 

subtracted from qfj[n\ to yield, Qfj{n\^qfj[n\-r^[n\ . For even 

higher throughput with additional computation, Q^W may be 

defined as Ql{n] = ql{n]--r^[n'\-F^^[n\ . 

1 . 2 Second Phase 

The second phase allocates cell slots by a request/grant 
format to the uncommitted traffic from the slots that remain 
after the first phase. Ideally, flow-level fairness and high 
throughput can be supported within a frame using a weighted 
maximum-minimum (max/min) approach. Several examples of 
weighted max/min processes for allocating bandwidth to data 
flows are described in detail below in section 6 below. The 
processes described in section 6 may be used in connection 
with the ERA process or independently thereof. 

Weighted max/min processes use a weighted max/min 
fairness approach to distributing network bandwidth. In this 
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approach each data flow is associated with a weight. A 
flow's weight is related to the amount of bandwidth it needs. 
This, in turn, is a function of both a flow's delay and 
average rate requirements. These weights are used to 
distribute the bandwidth fairly among the competing flows. 

In one weighted max/min process for allocating 
bandwidth, each VL is granted cell slots in proportion to its 
weight, <Pij^ until its input link or its output link reaches 
100% allocation or until it is granted enough cell slots to 
serve its uncommitted queue size. VLs meeting one of these 
conditions are "frozen". The remaining VLs receive 
additional cell slots until all VLs are frozen. 

A VL's weight for uncommitted traffic may be unrelated 
to its guaranteed rate. The rate should reflect the priority 
of the currently active uncommitted flows; e.g., the number 
of such flows in the VL. Thus, the weight provides fairness 
for uncommitted flows, but is not needed to guarantee delay 
for real-time traffic. One way to implement this approach 
involves up to N (N>1) iterations between input and output 
links with one input and output link reaching the frozen 
state on each iteration. This is described in greater detail 
below in section 6.1. 
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In practice, realizing a weighted max/min fair 
allocation for each frame using N iterations may not always 
be the best solution. Instead, a one iteration request/grant 
process for each frame may be used. 

Briefly, on each input link, excess cell slots are 
allocated in a weighted fair way until either the link is 
100% allocated or all requested cell slots are satisfied- 
Let gy[n]r F^^[n]r r^[n] and Qy[n] represent the queue size, the 

fractional cell slots, the allocated rate, and the requested 
bandwidth, respectively, for the uncommitted traffic of VLij . 
An expansion factor xi is defined for input link i such that 




(2) 



Let D.[n + l] = T~^r^[n+l] represent the leftover bandwidth for 



J 



input link i. 



An expansion coefficient x* is defined for 



input link i such that when >Z).[w + l] , 



J 



D,[n + l] = Y,r^ix:,n + l) 



(3) 
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Thus^ X* represents the minimum expansion factor yielding 

100% bandwidth allocation. Fig. 2 illustrates a typical plot 

of versus x. 

J 

A closed form solution for x* in equation (3) does not 
exist. A search technique may be used to find the solution. 
For example, define 



0 otherwise 



and 



l^iyL'^] Otherwise 



10 such that x*is the root of gi(x) in the equation 

Now, the well-known Newton-Raphson method may be used to 
iteratively find x* by 



V* = i ( 7 ^ 
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where the process is halted after a fixed number of 
iterations^ such as log(N) iterations. 

The uncommitted allocation is given by equation (2) 

using the value obtained in equation (7) . The 

request/grant phases in a frame for a distributed 
implementation are described with respect to Fig. 3. 

First, the committed bandwidth requests, Q^[n]f are sent 

to the output ports (outputs) of the switch. While this 
information is being sent, the input ports (inputs) perform 
both the committed and uncommitted bandwidth allocation. The 
inputs send the resulting uncommitted granted allocations, 

r^[n + l]r to an output. The output uses the r^[n-^l] as the 
requested rates, or, Q^[n] in equations (2) and (5), and sums 
over the inputs in equation (7) to determine the Xj values 
for the uncommitted bandwidth. The resulting output link 
granted rates, r^[n + l], are returned to the inputs. Granting 

of the committed traffic requests is calculated at the input 
ports, and does not need to be returned by output ports. 
Thus, the committed and uncommitted granted rates used in 
frame n+1 are the + values calculated at the input ports 

and the rg[n + Y] values returned by the output ports. 
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2 . 0 Fairness Between VLs But Not Flows 

The complexity of the uncoirirtiitted bandwidth distribution 
process is a function of N and the number of iterations in 
5 equation (7) • More than one iteration may not be optimal for 
small T values or for large N values. To reduce the 
complexity at the cost of sacrificing fairness between flows, 
one may weight the uncommitted VLs by their requested rates 
as described in "Real-Time Adaptive Bandwidth Allocation for 

ho ATM Switches", M. Hegde, 0. Schmid, H. Saidi, P. Min, Proc. 

1 IEEE ICC '99, Vancouver, 1999. This eliminates the piece- 

^ wise linear structure of the curve in Fig. 2 and allows 

i equation (3) to simplify to 



r;[n + l] = A[^ + l]x^i-L (8) 



15 The committed traffic is handled as given in equation (1) . 

If a VL is restricted to requesting no more than T cell slots 
per frame, it can be shown that, over multiple frames, the 
VLs receive bandwidth according to equations (2) and (7) with 

all 0ij equal and Qy[n] equal to the VL's average rate 

20 expressed in cells/frame. In this sense, fairness is 
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achieved by equally dividing the extra bandwidth among the 
VLs requesting it. Although fairness is achieved between 
VLs, it is not achieved between flows because VLs may contain 
different numbers of constituent flows. 

5 

3 . 0 Reducing Delay for Committed Traffic (Immediate Grant) 

Arriving committed traffic may wait as much as 2T before 
receiving cell slots from the ERA process of section 1.0. 
Fig. 4 demonstrates this latency. For large T values, it may 
10 be desirable to reduce this transition delay. The immediate 
grant process, explained here, reduces the transition delay 
by pre-allocating the guaranteed rate to the committed 
traffic at the expense of larger buffer requirements in the 

ff' fabric. For the committed traffic, set ^fM = ^/"'"*in each 

1^5 frame. Thus, the granted rate for the committed traffic is 

^ simply the guaranteed rate, rf"' . For the uncommitted 

2 traffic, however, let Qy[n] equal the amount of committed 

0 traffic sent in the previous frame rather than the amount 

requested for the next frame, and use this new Q^[n] to 

20 calculate r^[n + l] equation (1) and, thus, Z)^.[n + 1] in the 

equation D.[n+l] = T ~^r^[n-\-l] , This D.[n + 1] value can be used 
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with the processes of sections 1.2 and 2.0 when calculating 
the uncommitted bandwidth allocation. By virtue of this 
process, it is possible for the output links to receive more 
than T cells in a frame. 

The guaranteed rate creates a constant rate channel, 
with bounded jitter, between an input and output link for the 
committed traffic. This channel is available whenever 
committed traffic is backlogged. Thus, with a WFQ VL 
scheduler, for example, and the appropriate accounting for 
switch fabric delays, worst case delay guarantees can be made 
for committed traffic using the techniques presented in "A 
Generalized Processor Sharing Approach to Flow Control in 
Integrated Services Networks", A. Parekh, Ph.D. dissertation, 
Massachusetts Institute of Technology, February 1992, 

4 . 0 Queue Size Requirements in Fabric 

Queue sizes in the fabric produced by the processes 
described in sections 1 and 2.0 are bounded for any input 
scheduler obeying the allocated cell slots per frame. Assume 
that, for every frame, a VLRA process fills per-VL token 
queues with the frame's allocation. In each cell slot, the 
input scheduler can choose a given VL only if both the cell 
and token queues are backlogged. In the worst case, the 
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input schedulers send cells such that all T cells for an 
output link arrive at the end of one frame followed by all T 
cells for the same output link arriving at the beginning of 
the next frame. Since only N cells can arrive in a single 
5 slot^ these arrivals are actually spread out over 2\t/N'] cell 
slots. Therefore, because 2\T/N']-1 departures occur over 
this interval, the queue depths produced by the worst case 
input scheduling for the processes in section 2.0 and 3.0 are 



10 2r-(2rr/A^]-i) (9) 

In the immediate grant process of section 3.0, the two 
frame delay shown in Fig. 4 means that the uncommitted rate 
allocated for the k"^ update interval is based on the number 
of committed cells that were sent during the (^-2)"*^ update 

15 interval. In the worst case, when ^r?'''' =T , the immediate 

J 

grant reservation of committed traffic can produce as many as 
T committed and T uncommitted cells at an output queue during 
a given update interval. This may occur two update intervals 
in a row. If 2T cells arrive to the output queue at the end 
20 of one interval and 2T cells arrive at the beginning of the 
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next interval^ with no more than N cells arriving in the same 
slot, the worst-case queue depth is given by 

4r-(2[2r/A^]-l) (10) 

Notice that the immediate grant process eliminates the 2T 
delay shown in Fig- 4, but may cause an extra 2T delay in the 
fabric. Priority queuing in the fabric can eliminate the 
extra 2T worst-case delay. Decorrelating inputs that send 
cells to the same output can produce significantly lower 
worst-case queue bounds, and proper spacing of the cell slots 
within a frame can reduce burstiness. 

5 . 0 Simulation Results 

Simulation results for the processes of sections 1.0 to 
4.0 are presented here to indicate their expected 
performance. Three traffic types, CBR, VBR and UBR are 
passed through a 16 x 16 switch with T=128 cell slots. The 
VBR and UBR sources are bursty with PGR equal to 10 times the 
average rate and a maximum burst size of 100 cells. For all 
loads, every input and output link contains H CBR, H VBR and 
^ UBR; however, individual VLs contain a variety of traffic 
mixes. The loading is increased by adding connections 



-26- 



Docket No.: 06269/020001 

according to the traffic percentages. The switch fabric 
schedulers serve committed traffic with strict priority over 
uncommitted traffic. 

Fig. 5 illustrates the average delays for all three 
5 traffic types and the worst case delays for CBR and VBR using 
immediate grant accounting for committed traffic and a one- 
iteration, weighted fair allocation for uncommitted traffic. 
CBR is allocated enough bandwidth to allow a WFQ VL scheduler 
to always deliver a cell within 5 frames (640 cell slots or 
^JlO 113 lis at OC-48) . VBR traffic is given a large guaranteed 
ff% rate, enough to yield a worst case delay guarantee of 1.5 ms 

^ at 95% loading on OC-48 links with a WFQ VL scheduler. 

4I: As shown in Fig. 5, the CBR and VBR cells are completely 

O isolated from the UBR traffic. The maximum CBR delay 

r^15 decreases as more connections are added (statistical 
y multiplexing gain) , while the maximum VBR delay increases due 

to less guaranteed bandwidth per connection. The unused 
guaranteed rate is effectively captured and distributed to 
the UBR traffic. The process produces acceptable average UBR 
20 delays at 92% loading for this scenario. 

Fig 6. illustrates the same graphs for the request /grant 
process for allocating committed traffic using a one- 
iteration, weighted fair allocation for uncommitted traffic. 
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The worst case CBR delay is approximately 2T (256 cell slots) 
worse than that of the immediate grant process and the 
average delay is 1.5T worse. The UBR performs approximately 
the same as in Fig. 5, indicating that the advantage of 
5 sending extra traffic into the fabric is offset by larger 
delays from the fabric scheduler. A sixteen-iteration, 
weighted fair process for the uncommitted traffic was run on 
the same traffic scenario. No significant difference in 
average UBR delays occurred until 90% loading. The average 

10 UBR delay at 98% loading for the 16-iteration process was 238 
lis as compared to 2.4 ms for the 1-iteration process. 

For N = 16 and T = 128, the worst case queue depths 
given by equations (9) and (10) are 241 and 481, 
respectively. A round robin link scheduler may be employed 

15 with priority of committed traffic over uncommitted traffic. 

The worst case queue sizes yielded by the simulations were 86 
cells for the immediate grant process and 43 cells for the 
request /grant process of committed bandwidth allocation. 

Fig. 7 gives a per-flow throughput comparison for the 

20 uncommitted traffic achieved by explicit weights versus 

weighting by equation (8) . We set the sum of the UBR traffic 
going to one output link at 200% loading with some VLs 
containing twice as many flows as others. Fig. 7 indicates 
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the ratio of arrivals to departures for each flow. It 
demonstrates that explicit weights, based on the number of 
connections, produces approximately fair bandwidth allocation 
between flows, whereas weighting by desire allows some flows 
5 to capture more bandwidth than others. The flows receiving 
more bandwidth were members of VLs with a smaller number of 
constituent flows . 

6. 0 Weighted Min/Max Network Rate Assignment 
J|10 The following weighted max/min processes assign rates to 

p-? the various data flows as fairly as possible (e.g., in 

proportion to their weights), and improve on the "perfect 
4^ fairness" solution by allowing some connections to transmit 

y data at rates above their "fair" allocation. This increases 
^1^15 the network's throughput without reducing the data flows 
pi below their "fair" rates. Maximum rates for data flows may 

also be specified for flows that do not need bandwidth beyond 

predetermined limits. The sums of the allowed data rates at 

some of the links are not constrained to the capacity of 
20 those links. This gives the links more freedom to choose 

which flows to service. 

These processes prevent data from overwhelming certain 

links of a packet switched network while increasing the 
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network throughput and satisfying delay QoS coraraitments - 
When applied correctly, these processes can be used to 
prevent queue overflow in a core of a switch or other network 
device while allowing the throughput of the switch to remain 

5 relatively high. If traffic arrives at an output link too 
quickly, the queue lengths quickly grow beyond the buffer 
capacity and information is lost. These processes keep the 
throughput of the system high by dynamically redistributing 
bandwidth to the data flows that are backlogged. Thus, 

10 bandwidth is given to flows that need it, rather than to 

flows that are idle or need only a small amount of bandwidth . 
This dynamic redistribution of switch capacity allows the 
system to more easily meet varying QoS requirements. 

To summarize, in the first phase of one weighted max/min 

15 process, all flows start with a rate (i.e., bandwidth) of 
zero. Their rates are slowly increased in proportion to 
their weights. Thus, a flow with a weight of 0.6 will get 
twice as much bandwidth as a flow with a weight of 0.3. The 
increasing continues until one of the links in the network 

20 reaches capacity. At this point, the rates of all of the 
flows passing through the saturated link are frozen. 

All of the non-frozen flows can then be proportionally 
increased again until another link saturates. This pattern 
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continues until all of the flows are frozen by a saturated 
link. This approach yields the weighted max/min solution to 
the rate assignment problem. This solution has the property 
that in order to increase any rate above its max/min rate, 
5 the rate of at least one data flow with an equal or lower 
rate to weight ratio must be decreased. 

Thus, the max/min solution is the solution that- 
maximizes the minimum rate to weight ratios. It should be 
noted, however, that on some links, the sum of the allocated 
010 rates may be less than capacity. This can happen if all of 
ff^ the flows passing through the link are bottlenecked 
elsewhere- 

Because of practical constraints the max/min rates 
l^, cannot be updated instantaneously when the traffic conditions 

Cl5 change. This means that after the rates are determined they 
Pi must be used for some period of time. Certain variations on 

the basic process may, therefore, be useful. For instance, 
depending on the traffic mix, it may be advantageous to give 
each flow a minimum rate and then give out the extra 
20 bandwidth based on the flow weights. Some flows may have a 
maximum bandwidth requirement. In these cases, a flow would 
not use any bandwidth beyond this amount. Therefore, the 
flow is frozen if it reaches its maximum rate. The weighted 
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max/min process can be modified to take these minimum and 
maximum rates into account. 

Weighted max/min concepts are also extended in order to 
increase the throughput of the network. This extension is 

5 based on the observation that, in many cases, the actual 
traffic patterns in the network differ from the predicted 
patterns upon which the rate assignments were based. In 
these situations, it is useful to give the links ranges for 
each data flow rate. Then, if one flow is not using its 

10 bandwidth, the link could give it to another flow. Using 
this approach, it is possible to overbook some links. If 
more data is coming in to a link than can be sent out, the 
traffic is either transmitted or dropped according to well- 
defined and agreed-upon rules. Some subset of links can be 

15 chosen to always have the potential for full utilization but 
not for overloading. If, after allocating rates using the 
weighted max/min rules, the links are not fully committed, 
the extra bandwidth is distributed to the flows passing 
through the link, even if this causes other links to be 

20 overbooked. This serves to maximize the throughput of the 

network while protecting this subset of links from overflow. 

In an input buffered switch, the output ports would not 
be overloaded while the input ports could be overbooked. 
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This would allow a scheduler at the input port to decide how 
fast to serve each flow while filling the input port's 
capacity. Since this scheduler has access to the current 
make-up of the queued traffic (non-real-time vs. real-time^. 
5 which flow is bursting, etc.) it can make a more appropriate 
decision on rates than the original rate assignment. The 
output ports, however, are protected from overload. 

Fig. 8 shows an example of a switch with max/min fair 
rates, input port overbooking, and output ports always 
00 booked to full capacity. In this example, the capacities of 
p all of the input and output ports are assumed to be unity 

and none of the flows have an associated minimum or maximum 
4^ rate. Flows AI and All both have a weight of 0.4, Under 

normal max/min fairness each of these flows would have 
'h:Ii5 received a rate of 0.5, making the sum of the allowed rate 
jE| through input port A equal to its capacity. This would 

leave both output ports I and II underbooked, each with a 
capacity of 0.5. If the traffic arriving at port A was 
exclusively headed for port I, it would have to be sent on 
20 with a rate of 0.5. With the overbooking shown here, this 
data could be sent on at a rate of 1, increasing the 
throughput of the switch. 
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Port III demonstrates one aspect of max/min fairness. 
Since port III is the first port to bottleneck, the rates 
assigned to each flow are proportional to their weights. 
Since all of their weights are the same, all of the rates 
5 are the same. Now examine port C. With rciu frozen at 1/3, 
rciv is free to expand up to a maximum of 2/3. Because port 
IV is not heavily loaded, that happens. Thus, although both 
flows passing through port C have the same weight, the flow 
passing through the heavily loaded output port only receives 
Jho a rate of 1/3 while the flow passing through the less 
heavily loaded output port receives a rate of 2/3. 

# 6.1 Generalized Processor Sharing 

C Generalized Processor Sharing (GPS) , which is a type of 

Q5 fair queueing, is a service method used to multiplex traffic 
from a number of sources onto a single line. GPS deals with 
multiplexers or networks of multiplexers. In some switch 
architectures, the input and output nodes act as 
multiplexers, but there is very little buffering before the 
20 output nodes. Accordingly, care must be taken to limit the 

amount of traffic coming to each output node. In fact, it is 
advantageous to think of the switch as an entire unit rather 
than as a collection of independent nodes. Some of the ideas 
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of GPS may be used in this process. These are described in 
the multi-stage (e.g.^ two stage) GPS section^ where Extended 
GPS (EGPS) is introduced. GPS is referred to herein as 
single node GPS to differentiate it from multi-stage GPS. 
5 In prior art GPS, rate recalculations are performed 

instantaneously whenever needed. In the present process^ 
this is not necessarily the case. How often the bandwidth 
allocations can be re-determined will have an effect on how 
complicated the bandwidth allocation process is, what 
^0 information is needed, and how well the system performs. 

|I 6.1.1. Single Node (Stage) GPS 

£ Consider the single node (normal) GPS server shown in 

0 Fig. 9. Several connections, each queued separately, are 

^^^5 being multiplexed onto a line with a capacity of C. At a 

given time, a connection may be either idle, if no cells are 
waiting to be served, or backlogged, if there are one or more 
cells waiting. Flows that are idle receive no bandwidth, 
since they have no cells to send. The server's bandwidth is 
20 divided between the backlogged flows in proportion to a set 
of weights. These weights are key to one of the aspects of 
GPS, its notion of fairness. 
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Each connection is assigned a weight (pi. The GPS notion 
of fairness is a weighted fair bandwidth distribution between 
the backlogged connections. If two connections, i and j, are 
backlogged and are assigned rates ri and rj respectively, then 
5 the rates are fair if the rates are in proportion to the 
ratio of their weights. That is. 



ff^O Equation (11) must hold true for any number of backlogged 
f'f connections. Any number of rate sets will satisfy this 
'% constraint. One set of rates is the set that sums up to the 

PI server's capacity. For example, letting B be the set of 

connections that are backlogged (and therefore will receive 
p15 non-zero rates) , the rate of a connection in B is 



0 



(11) 




(12) 
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where b~Cj^.^^(f>j is the expansion factor. Any non-zero rate 

can be determined as Vj^-h^^. This satisfies the GPS notion 
of fairness since 

- = T^ = 7^. (13) 

This also forces the sum of the rates to be equal to the 
server capacity. To see thiS;. write 



jeB jeB^ JeB jeB^ JeB Z^j^B^jJ^^ 



Finally, note that the rates are re-determined any time the 
? set B changes, which could be quite often. 

^_ One way of thinking about what is happening is to give 

:l each backlogged connection a rate r.=S'<f>i. By starting e at 

15 zero and slowly increasing its value, the bandwidth assigned 
to each flow slowly expands. At all times, the bandwidths 
are GPS fair because each one is proportional to its weight 
^i. e can be increased until the sum of the rates reaches 

the capacity of the server. The ^ for which ~ ^ ^• 
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The next issue is how to calculate the weights, 9. For 
real-time connections, one way to calculate the weights is to 
consider the minimum service rate necessary for connection to 
meet its QoS contract. For instance, if a connection is 
leaky bucket constrained (maximum burst of a, average rate of 
p) and all cells need to be served in t seconds, the 
connection needs a minimum rate of (a/r) no matter what other 
traffic may be at the server. In GPS, a connection will 
receive its smallest rate when every connection is 
backlogged. Thus, the minimum rate a connection receives is 



where N is the number of connections. By constraining the 
sum of the weights to be one or less. 



mm 



<l>iC 



(15) 





(16) 
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So, in order to accept the (N+1)^^ connection, calculate 

.^^^j<l, the connection can be 

admitted. Otherwise, the connection should be rejected. 

For non-real-time connections the weight cp is less 
important. It determines the amount of bandwidth a flow 
receives during periods when many connections are backlogged. 
Non-real-time connections can be given little or no bandwidth 
for short periods of time (when there is delay sensitive 
traffic to send) as long as they eventually receive enough 
bandwidth to support their long-term average rates, p. 

As long as the server is not overbooked, that is, as 

j_^Pj <l r every connection will receive enough 

service to support its long-term average rate. It should be 
noted that the bandwidth a connection (real-time or non-real- 
time) receives will not be constant. In fact, to receive 50% 
of the server's bandwidth, a connection may receive a very 
small percent of the bandwidth while other traffic is present 
and then receive 100% of the bandwidth for a period of time. 
There is no peak rate constraint on the output of the server. 
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6.1.2, Multi-stage GPS 

6.1.2.1 GPS Fair Rate Allocation 
Consider the two input, two output switch shown in Fig. 
10. Four data paths or flows exist, AI, All, BI, and BII 
with rates rAi, r^n, rsi, and rBn respectively. To assign a 
set of rates to the flows in a switch, the sums of all of the 
rates through each of the nodes (input and output) must be 
less than or equal to the capacity of each of the nodes. 
That is, for a node Y, the sum of the rates of the flows 
through Y must be less than the capacity of the node Y. For 
the switch to be stable, this should true for each node in 
the switch. 

In the architecture of concern here, if cells are 
arriving at an input node at a rate higher than the rate the 
flow has been assigned, the cells are queued at the input 
node. The output nodes are assumed to have minimal queuing, 
as noted above. This could be an NxN switch. 

One problem is how to pick the input and output flow 
rates for the cells. One way to pick the rates is to extend 
the ideas of GPS to this two stage model, where "input" is 
one stage and "output" is the other stage. Assume that each 
flow has a weight associated with it, cp^. Note that each 
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flow gets one weight, not a different weight at an input node 
and another at an output node. 

If a node in the switch is "GPS fair" (also called 
"weighted fair"), then the rates of all of the flows passing 
5 through this node are in proportion to the weights of the 

flows. If all of the rates in a switch are in proportion to 
the assigned weights, then the entire switch is GPS fair. 

With two levels, there should be GPS fairness among all 
of the flows. This should be true even though many flows do 
yiO not share a node. Thus, to be GPS fair ^ai/^bii ~ ^ai/^bii - "^^ 
^ determine the maximum rates that are still GPS fair, the 

i2 bandwidth assignments of each of the flows are slowly 

„g increased as in the single node case. Assign each backlogged 

Q flow a rate r^=S'^^, Starting at zero, slowly increase s. 

V15 As long as ^^r^<CY for all nodes, keep increasing e (where 

□ the sum is over the flows through any node Y and Cy is the 

capacity of that specific node) . Eventually one of the nodes 
will saturate. That is, for one (or more) nodes, e will get 

large enough such that ^r^^Cy for some node Y. 

20 At this point, stop increasing s and let be this 

specific £ value. The node(s) that saturate (s) is (are) 
known as the bottleneck node(s). These rates, using as 
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the expansion factor, give the switch the largest throughput 
possible while still being GPS fair. Any increase in the 
rates through the bottleneck node will cause the bottleneck 
node to be unstable;r but increasing the rate of some 
5 connection while not increasing the rates of the flows 

through the bottleneck node will cause the rates of the flows 
to no longer be GPS fair. One way to calculate , the first 
breakpoint, is to consider each node in isolation and 
calculate jb for each one. The smallest h is . 
yjio As an example, consider the 2x2 switch shown in Fig. 11. 

P Assume all of the flows are backlogged so that they receive 

^ bandwidth proportional to their weights 9ai=1/4/ (Paii=1/8, 

(Pbi=1/4, and 9bii=1/16. Assume the output capacity of each 

node is 1, For these values, b^ is 2, which make rAi=l/2, 
4r.15 rAii=l/4, rBi=l/2, and rBii==l/8. The bottleneck node is node I. 
O The total throughputs of each of the nodes are 





Node 


A: 




= r^+r^,=l/2 + U4 = 3/4 






Node 


B: 




= ^B/ + =1/2 + 1/8 = 5/8 




20 


Node 


I: 




= r^+r,, =1/2 + 1/2 = 1 


(17 




Node 


II: 




= r^jj +rs,i =1/4 + 1/8 = 3/8 . 
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6.1.2.2 Extended GPS Fair Rate Allocation 
In the example of Fig. 11, three nodes have leftover 
capacity. While the bandwidths of the flows passing through 
node I cannot be increased, it is certainly possible to 
5 increase the bandwidth between node A and node II and node B 
and node II. In this case, the rates would no longer be GPS 
fair. Increasing the rates through node II would not hurt 
any of the flows through node I, however. This leads to a 
new strategy for rate allocation: be GPS fair for as long as 
^^10 possible and then parcel out the remaining bandwidth in a 
m manner that does not harm any connections. That is, first 

expand all the rates to b^ip^r then increase various rates 
J' while never decreasing any rate below b^^^ . This approach 

S will make every rate b^(p^ at a minimum, while preserving the 

J15 potential to make some rates higher. 

^ The question now becomes how should the extra bandwidth 

be allocated. Any allocation method may be used. One method 
continues to distribute the bandwidth in proportion to the 
(p's. Conceptually, this means freezing the rates of the 

20 flows through the bottleneck node(s) at b^^^ and then slowly 

increasing € beyond b^ so that the non-frozen rates can 
increase. e should be increased until the next node(s) 
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bottlenecks. Call this value of s bl . By freezing the rates 
of the flows through these new bottleneck nodes (except for 
the flows that have already been frozen) , z can again be 
increased until another node(s) saturates. This pattern can 
5 be repeated until all of the flows are frozen. The aim of 

this allocation process is to be fair to as many flows for as 
long as possible. For a given node Y, all of the connections 
that were first constrained by node Y will have rates that 
are GPS fair relative to each other. 

,;s:ai. 

ChO It should be noted that while bl is equal to the 

'X, smallest b value, bl is not necessarily the second smallest b 

[fi value. The second bottleneck node may not even be the node 

^ with the second smallest b value. This is because the 

original b was determined assuming all of the nodes were 
■ff15 independent. As soon as one node bottlenecks, and the 
— bandwidth of the flows passing through it are frozen, the 

other nodes will saturate at a potentially larger e value. 

If a node has a frozen flow passing through it, the rates of 

the other flows will grow larger than they otherwise would 
20 have been able to grow. Thus, the node will saturate later 

than it would if it were independent of the other nodes. 
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Returning to the example of Fig. 10, the rates rAi and 
rsi need to be frozen but rAn and rBn can be increased. The 
next node to bottleneck is A when rAii=l/2. At this point, 
rBii=l/4. See Fig. 12 for these rates. This still leaves 

5 ^B^x"^^^ 2]//^x-3/4, Since rBn is not frozen, it can be 

increased until node B and/or node II saturate. In fact, 
they saturate at the same point, when rBii=l/2. This makes 
all of the rates 1/2 and all of the nodes fully loaded. Fig. 
12 shows these rates also. 

10 An earlier example demonstrated that a GPS fair rate 

allocation does not maximize the throughput of a switch. 
Extended GPS (EGPS) fair rate allocation can, in many 
instances, increase the throughput of a switch, but EGPS need 
not necessarily maximize the throughput of a switch. 

15 Consider Fig. 13. The GPS fair allocation and the EGPS 

fair allocation of rates assign each flow a rate of 1/2 for a 
total switch throughput of 1.5. A close examination of the 
situation reveals, however, that assigning flows AI and BII 
rates of 1 and flow BI a rate of 0 gives a throughput of 2. 

20 If all of the connections entering a single node GPS 

server are leaky bucket constrained (a,p), a connection is 

guaranteed its average throughput if ^Pi<C. This does not 
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depend on the size of the connection's p value. In heavily 
congested periods of time, real-time connections (where 



more than their long-term share of the bandwidth. This is to 
ensure that their delay QoS constraints are met. Non-real- 
time connections receive rates less than their long-term 
average rates since these cells can wait longer without 
violating their contracts. Because the real-time traffic is 
leaky bucket constrained, the period of time that real-time 
connections can be backlogged is bounded. When the real-time 
flows become idle, all of the bandwidth is split amongst the 
non-real-time traffic. At this point the non-real-time 
traffic flows receive large amounts of bandwidth. 

In two stage EGPS, the foregoing is not the case. Even 
if every node is under-loaded, it is still possible for a 
connection not to receive an average throughput equal to its 
long-term average rate. The following example shows a case 
where this may occur. Fig. 14 shows the long-term average 
rates (p) , maximum bucket sizes (a), and weights {<p) for 
three flows passing through two nodes. At each node, the sum 
of the long-term rates is less than the capacity of the nodes 
(assumed to be 1), since 5 is small. For this example, fluid 
flow approximations are used for the queues and the servers. 




in most cases) tend to dominate and so take 
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and it is assumed that the incoming links have infinite 
bandwidth. This approximation simplifies the math greatly 
without affecting the general problem. 

Assume at t=0~ that both servers are idle. At t=0, a 
maximum burst of 1 cell arrives from connection B and cells 
from connection B begin arriving at a rate of 1/2. Also at 
t=0, cells from A begin arriving at a rate of 1/4+5. While 
both A and B are backlogged at the first node, connection A 
receives service at a rate of 1/4 while connection B receives 
service at a rate of 3/4. Since cells from connection B are 
arriving at a rate of 1/2 but are served at a rate of 3/4, 
the backlog of cells waiting to be served is reduced at a 
rate of 1/4. Since this backlog began at t=0 at one cell, 
the queue for connection B empties after 1/(1/4) =4 seconds. 
During this time, cells from connection A arrive at a rate of 
1/4+ 5 and are served at a rate of 1/4. Thus, for 4 seconds 
the queue of A grows at a rate of 5. Starting from 0, this 
queue reaches a depth of 4 5 at t=4. 

At t=4, connection A should begin receiving a larger 
amount of bandwidth from the first server because connection 
B is no longer backlogged at this time. However, at t=4, 
connection C, silent until this time, sends a cell and begins 
transmitting at a rate of 1/2. While C is backlogged, from 
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t=4 to t=8 (note that B and C have identical parameters) 
connection A receives service at a rate of 1/4. In a normal 
multi-server network, the first node can transmit connection 
A cells at a rate higher than the second node can serve and 
5 the extra cells can be buffered at the second node. In the 
architecture of concern here, this is not the case. Since 
the bandwidth a connection receives is the minimum of the 
bandwidths available at each node, connection A is served at 
a rate of 1/4 in both nodes and the extra cells are queued at 

10 the first node. Thus, from t=4 to t=8, the queue of 

connection A cells at the first node grows from 45 to 85, 

If connection B stopped sending cells at t=4, it could 
begin to replenish its bucket at a rate of 1/2. In 4 seconds 
it could save up 4 (1/2) =2 tokens. Since the bucket size is 1 

15 cell, however, only one token can be kept. Assuming 

connection B did stop sending cells at t=4, by t=8 connection 
B can burst 1 cell. At t=8, it does send an entire cell and 
begin to transmit at a rate of 1/2. As at t=0, connection A 
is limited to a rate of 1/4 for 4 seconds and its queue grows 

20 by 45. At t=8, connection C becomes silent so, as with 

connection B before, it saves up enough credit to burst a 
cell at t=12. By having connections B and C alternately 
bursting and then being quiet, connection A can be held to a 
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rate of 1/4 indefinitely, below the average arrival rate of 
l/4-i-(5. Since the queue is growing at a rate of 5, this means 
that the queue for A can grow without bound. 

A connection will receive its long-term average rate if 
5 the sum of the average rates of all of the connections 

passing through both the input and output nodes that the 
connection passes through is less than the capacity of the 
links. In the example, the sum of pA^ Pba ^nd pc is 1 + 5 >1 . 
The example of Fig. 14 demonstrates that, unlike the 
gio single node GPS case, the average rates of connections must 
m be considered when assigning cp values. Typically, (p values 

fll are assigned to guarantee connections at a minimum rate 

€! during congested periods. Starting with the maximum burst 

I,, size of a connection, cr, and the maximum delay parameter, r, 

a minimum cp value can be found. Since a cells need to be 
Q served in r seconds, the minimum rate of a connection is a/i. 

Normalizing this rate by the link rate, (a/r)/C, gives the 
minimum fraction of the link bandwidth that the flow must be 
guaranteed. Assigning this number to (p ensures that the 
20 real-time constraints will be met, as long as the sum of the 
(p's at the node sum to one or less. With EGPS and a multi- 
stage architecture, ^'s need to be assigned to ensure a 
minimum bandwidth at all times. Thus, the weight for a 
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connection must be at least p/C. Because of this, the bound 
on the sum of the weights will be reached much sooner than in 
the single node case. This will result in lower utilization. 

One consequence of this is that real-time and non-real- 
time traffic must be carefully balanced to maximize 
utilization. By coupling real-time traffic, which requires a 
large amount of bandwidth for delay purposes but cannot 
consistently utilize this bandwidth because of a low average 
rate, with non-real-time traffic, which has a large average 
rate but a high delay tolerance, a given amount of bandwidth 
between two ports may be kept consistently full while 
ensuring that delay bounds will be met. By sharing the 
bandwidth between these ports and giving the real-time 
traffic strict priority over the non-real-time traffic , the 
real-time delay requirements may be met and the bandwidth may 
be utilized more efficiently. 

More generally, consider the necessary bandwidth ("BW") 
between two ports due to real-time ("rt") connections between 
the ports 

= (18) 

rt flows 
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and the bandwidth necessary due to the average rates of all 
of the connections 



5^;^"^ Y^p. . (19) 

all jlows 



The weight necessary to support these connections is 



- = ^ (20) 

WlO utilization of the bandwidth can be 100% if BW^''>BW^''. 

^ Thus, high utilization depends on having enough non-real-time 

P traffic. 

1=^ In single node GPS, the available server bandwidth is 

y divided among backlogged connections. Thus, if a connection 

15 goes from idle to backlogged, the previously backlogged 

connections have their rates reduced. This is not always the 
case in multi-, e.g., two, stage GPS. 

Consider the switch shown in Fig. 15, where the node 
capacities are assumed to be one (1) . Assume initially that 
20 flow AI is idle. Node B is the bottleneck node and the two 
backlogged flows receive the rates rBi=8/9 and rBii=l/9. Now 



-51- 



Docket No.: 06269/020001 

assume flow AI becomes backlogged. Node I is the bottleneck 
node so r^j and r^i are each limited to 1/2. Flow BII, being 
the only non-frozen node, expands to fill the available 
bandwidth and is assigned a rate of 1/2. Thus, when AI 
becomes backlogged, the rate assigned to BII goes from 1/9 to 
1/2. This occurs because the newly active flow, AI, does not 
share a node with BII. It is possible for yet another 
connection that does not pass through either nodes B or II to 
become backlogged and cause the rate of BII to be reduced. 
In particular, if another flow through A becomes active and 
limits the rate of AI to less than 1/2, then Tbi becomes 
greater than 1/2 and Tbh ends up less than 1/2. 

The following illustrates how multi-stage architectures 
running EGPS act under certain connection patterns. For 
instance, in Fig. 16, all of the input nodes flow through a 
single output node. In this case, the output node will be 
the bottleneck node and the input nodes will accept whatever 
bandwidth the output node will allow them. The output node 
will distribute the rates in a GPS fair manner. Thus, the 
output node will act like a single node GPS server. This 
example shows that it will approximate the GPS process 
exactly under the correct circumstances. 
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In Fig. 17, all of the flows pass through a single input 
node- If the incoming link has a bandwidth equal to the 
bandwidth available leaving the node, there should not be 
many scheduling decisions because cells are sent out as they 
come in. Since none of the output nodes have traffic from 
other input nodes, they can handle all of the traffic the 
input node can send them. If, however, the incoming link has 
a peak rate higher than the service rate of the input node, 
then cells may come in faster than they can be served and 
multiple flows may become backlogged. The average rate of 
arrival is bounded by the input node service rate. 

At this point the input node must decide how to serve 
the backlogged connections. Since the output nodes can 
devote 100% of their bandwidths to the flows, they are not 
the bottlenecks. The input node, which is the bottleneck 
node, distributes bandwidth based on the weights of the 
backlogged connections. Thus, the input node is acting like 
a single node GPS server. As before, EGPS becomes GPS. 

6.2. Rate Assignments for an Interval 

One of the assumptions underlying GPS is that the rates 
at which the flows are being served are re-determined 
instantaneously when any connection goes from idle to 
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backlogged or backlogged to idle. This may be impractical, 
especially in a multi-stage architecture using a process such 
as EGPS where nodes not independent. This interdependence 
means that nodes need to communicate with each other, which 
5 makes instantaneous recalculations difficult . 

One alternative to recalculating rates instantaneously 
is to calculate them on a fixed interval. At predetermined 
time intervals, the necessary data about each connection is 
examined and the rates are adjusted. 

./OlO One issue is how often the rates must be re-determined. 

fa. 

gl Consider what happens as the recalculation interval grows 

from very short, where the expected performance of the 
process is close to the performance of the process with 
% instantaneous rate recalculations, to very long. A single 

'p15 node server will be considered to simplify the analysis. 

6.2.1 Very Short Intervals 
First consider the interval to be very short. How short 
is very short is not set, but it could be as short as one or 
20 two cell slots. At the longest, a very short time interval 
is a small fraction of the delay tolerance of the real-time 
traffic. For very short intervals, the GPS process may be 
used without modification. This means that each connection 
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has only two states - backlogged and idle, and the weights 
are fixed. Bandwidth is assigned to the backlogged 
connections in proportion to these weights and each 
backlogged connection is given as much bandwidth as possible. 
5 For instance, consider Fig. 18. Assume that both 

connections have the same priority, so they have identical cp 
values. Connection one (1) has one cell queued and 
connection two (2) has many cells queued. In standard GPS 
with instantaneous rate recalculation, both connections would 
^£^0 get 50% of the server's bandwidth. Connection 1 would only 
Jf,: need it for a short while, until its single cell was served, 

and then the connection would be idle. At this point the 
J: rates would be re-determined. Since flow 2 is the only 

p backlogged connection, flow 2 would receive 100% of the 

H15 server's bandwidth and the cells in flow 2 would then be 
O served at a high rate. 

Now assume the rates are re-determined every 4 slots. 
In this case, each connection would receive two sending 
opportunities. Flow 2 would use both of its opportunities 
20 but flow 1 would only use one opportunity. Thus, one slot 
would go unused. After four slots, the rates would be re- 
determined. Since connection 2 is the only backlogged flow, 
it would receive all of the slots for the next interval. 
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6.2.2 Short Intervals 

Consider the example from above, assuming that the 
interval between rate recalculations is longer but still 
5 relatively short. For the purposes of this example, 

relatively short is 20 slots or thereabouts. If each of the 
connections above is given 50% of the slots, then connection 
1 will waste nine slots. The overall efficiency of the 
server is rather poor. One way around this problem is to 

10 assign bandwidth not simply based on a static weight and a 

binary state (idle/backlogged) , but rather to incorporate the 
queue depths into the rate assignments. For instance, in the 
above example, connection 1 would be given a rate such that 
one or two cells would be served over the interval. The 

15 remaining slots would be given to connection 2. Thus, using 
the current system state information, a higher utilization 
can be achieved. 

6.2.3 Long Intervals 

20 Now assume the interval is even longer. What if 

connection 1 is assigned a small rate and then a large number 
of connection 1 cells arrive? These cells would be forced to 
wait until the end of the interval. Only then would the rate 
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assigned to connection 1 be increased. The last cells to 
arrive would still have to wait until the earlier cells were 
served. Thus, cells could easily have to wait almost 2 
intervals before they are served. With a short interval, 

5 this may not be a problem. If this interval is long, e.g., a 
large fraction of the delay is bound for a real-time 
connection, this may not be acceptable. In these cases, it 
may be necessary to predict what the upcoming traffic will 
look like or plan for the worst case arrival pattern. For a 

10 CBR connection with a rate of r, the connection is given a 
rate of r, even if there are currently no cells from the 
connection queued. Likewise, a real-time VBR connection may 
need to be given enough bandwidth to handle a maximum burst 
in order to ensure that if a burst does occur, the node can 

15 handle it. 

One way is to assign bandwidth to connections based on a 
minimum bandwidth needed to meet real-time delay bounds and a 
weight to determine how the leftover bandwidth should be 
divided amongst flows. The minimum rate, r/'''', should be the 

20 bandwidth needed by the real-time connections- The weight, 
yffi could be based on queue depth, average rate of the non- 
real-time traffic, and other parameters. The bandwidth a 
connection receives is given by 
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r,=rr+bfi, (21) 

where b is the expansion factor common to all of the 
connections. It is set so the sum of all of the connections 
rates is equal to the bandwidth of the server. Thus, 

C = Trj=lrr+bj:j3j (22) 
J J j 

which, rearranged, gives 

b= J:^' . (23) 

Using the process of section 6.1, a CBR connection can 
be given a r/" value equal to its average rate (assuming it 
didn't need any extra bandwidth for delay purposes) and a (pt 
value of zero. This would ensure that the connection 
received all the bandwidth it needed, but no more (which 
would be wasted) . A non-real-time connection, in contrast, 
would receive r/^"=0 but a large (pi value. This connection 
would surrender bandwidth to real-time connections if 
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necessary, but it will take a large portion of any available 
extra bandwidth. 

A number of other variations may be used. For instance, 
if a connection does not require bandwidth beyond a certain 
5 amount, then an r/^^ value may be set. Additional bandwidth 

would not be awarded to a connection past r/^^^ . If ^r^^ <C 

then a second set of weights can distribute the extra 
bandwidth for UBR traffic. 

10 6.2.4 Very Long Intervals 

As the interval between rate recalculations becomes very 
long, it becomes increasingly difficult to make predictions 
about the state of the switch during the interval. While the 
queue may be quite full at the beginning of the interval, it 

15 may empty and fill several times before the rates are re- 
determined. It is difficult to justify giving a flow a low 
rate based on the fact that its queue is currently empty when 
it may burst many times before its rate is increased. 
Actually, if the flow has not burst recently, then it may 

20 have stored up some burst credits and may be more likely to 
burst in the future. In this set-up, rates are assigned 
based on permanent parameters, such as maximum burst size and 
long-term average rate. 
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6 . 3 Assigning Peak Rates 

The rates assigned to flows can be viewed as peak cell 
rates (PGR) and the sum of these rates can exceed the 
5 capacity of an input node. In this process^ a rate does not 
guarantee a flow a certain amount of service, but instead is 
an upper bound on the amount of service a flow could receive. 
If the sum of the rates of the flows passing through an 
■f^ output node is less than or equal to the capacity of the 

g\ 10 node, then the output buffers will not overflow. If the sum 
RJ of the rates of the flows passing through an input node is 

@ greater than the capacity of the node, data loss will not 

l_ likely occur since these rates do not control how fast 

ff traffic is arriving. The input node cannot serve the flows 

% 15 at these rates, but it has the freedom to serve the flows at 
any combination of rates such that the sum of the actual 
service rates is equal to (or even less than) the capacity of 
the node and each of the individual flow rates is less than 
or equal to its assigned peak rate. 
20 There are situations where assigning peak rates is quite 

useful from the standpoint of both increasing throughput and 
decreasing the time it takes cells to pass through the 
system. These situations involve assigning rates to flows 
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for an interval of time when the interval of time is long 
enough that it is necessary to predict which flows might have 
cell arrivals. The following examples assume that a rate 
must be assigned to each flow for an interval of time. All 
5 of the nodes are assumed to have a capacity of C. 

Example 1 : Fig. 19 shows a two-stage switch with two 
connections. All of the connections pass through the same 
input node (A) but different output nodes (I and II) . All of 

10 the connections have the same weight (0.4) so they are all 

equally important by some measure. For this example, assume 
that both connections X and Y have a large number of cells 
queued at A waiting to be served (enough so that neither 
connection will run out of cells during an interval - even if 

15 one of the connections is allowed to send cells at a rate of 
C for the entire interval) . Since the connections have the 
same weights, each flow is assigned a rate of C/2. 

Example 2 : Predicting cell arrivals. Using the same 
20 switch set-up as in example 1, assume that neither connection 
has any cells queued. It is likely that some cells will 
arrive on one or both of the connections, however, so 
assigning rates of zero to each of the flows is not a good 
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idea. At first, the most reasonable set of rates appears to 
be rx^rY=C/2, since the two connections have the same weights. 
A further examination reveals another possibility, however. 
Firsts note that the weights of the connections, and even the 
5 long-term average rates of the connections, px and py, do not 
give much information about what the traffic mix will be in 
the short term. It is possible that the incoming traffic mix 
will be 50% X and 50% Y, but it is also possible that the mix 
will be 75% X, 15% Y, and 10% unused or even 0% X and 100% Y. 

10 Second, note that nodes I and II serve only connections X and 
Y respectively. Thus, as far as these output nodes are 
concerned, there is no downside to giving either flow a rate 
of C. Since I and II are not receiving traffic from any node 
but A, there is no chance of overwhelming either of them. 

15 Combining these two fact leads to the conclusion that the 

best scheduling rule for node A is to not schedule cells at 
all, just send them on as they come in. Since each output 
can handle cells at a rate C, and cells cannot arrive at the 
input faster than C, it is not possible to overload the 

20 output nodes. 

Equivalent to this scheduling rule is assigning PGR 
rates of rx=ry=C to the flows. Even if 100% of the incoming 
traffic was destined for one of the outputs, it could be sent 
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on at a rate of C, which translates into never queuing a 
cell. This means better throughput and lower delay for 
cells. If traffic does arrive in a pattern of 50% X and 50% 
Y, the traffic would still be handled first come first served 

5 with no cells being queued. Note that even though rx+ry=2C>C 
at the input node, there will not be any problems. Traffic 
still can enter node A at a rate of C. In contrast, consider 
what would happen if rx=ry=C/2 and 100% of the arriving 
traffic is destined for output X. Since connection X is only 

10 entitled to every other departure opportunity, every other 

cell departure opportunity would be unused while connection X 
cells are waiting to be served. Both node A and node I are 
greatly underutilized. 

15 Example 3 : Example 1 revisited. Now reconsider the 

situation in example 1 with a PGR rate assignment approach. 
If nodes I and II were to assign flows X and Y PGR rates, 
they would set rx=rY=C. Node A cannot send to each of the 
output nodes at a rate of C, since this would mean it was 

20 serving cells at a rate of 2C. Node A can, however, make its 
own scheduling decisions and serve both X and Y at rates that 
seem fair to node A as long as the rates node A assigns to 
the connections are less than or equal to the PGR rates that 
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the output nodes have assigned to the flows (in this case, 
this can be done since the PGR' s are equal to the capacity of 
the input node) . It seems reasonable for the input node to 
divide up its available bandwidth equally between the two 

5 flow since cpx- ^y- If the weights were not equal, it would be 
reasonable for node A to divide up the bandwidth in 
proportion to the weights. In fact, it is even possible for 
the input node to change the bandwidth assignments in the 
middle of an interval if one of the queues becomes empty. 

10 Because the input node is acting independently of the output 
nodes and the other input nodes, it does not need to 
communicate with them when deciding on service rates. Thus 
the communications delay problem, the problem that led to 
interval rate assignments in the first place, has been 

15 removed to some degree. 

As long as the input nodes serve connections at rates 
less than the assigned PCR's, they can change their rates 
whenever it is convenient to do so and to whatever they think 
is fair. It is this freedom that allows the input nodes to 

20 maximize their throughput. The PGR rates still need to be 
fixed for an interval of time, since setting them still 
requires communication (state variables from the inputs to 
the outputs and rates from the outputs to the inputs) . 
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Example 4 : Fig. 20 shows a 2x2 switch with three 
connections. There are several ways of assigning rates based 
on what is to be maximized and what is the predicted input 

5 pattern of the traffic. For instance, if throughput is to be 
maximized and the queues for flows X and Z are fairly deep 
(and/or we expect a lot of traffic to arrive for these 
connections) , the following assignments could be made: 
rx=r2=C and ry=0. A fairer solution would be to assign 

10 rx=rY=rz=C/2, but this would come at the expense of some 
throughput . 

Examining the problem from a PGR point of view gives 
some insights. Initially, there is no reason not to assign 
rx=C, since node A would be free to serve flow X at a slower 

15 rate if flow Y has traffic. Then, the question is what PGR' s 
to assign to flows Y and Z. To avoid the possibility of 
overflowing node II, we must have rz+ry^G, At this point, the 
flows depend on probabilities. For instance, if connection Z 
is a very regular GBR stream arriving dependably with a rate 

20 of O.IG, while connection Y is a VBR stream that fairly 

regularly bursts at rates near G, the rate assignment process 
gives Z a PGR of O.IG, since Z will use all of this bandwidth 
but never any additional bandwidth, and gives Y a PGR of 



-65- 



Docket No.: 06269/020001 

0.9C, since Y might use it. Even though node A already has a 
sum of rates greater than its capacity (l+ry) , flow Y is 
still more likely to use the bandwidth than flow Z. If, on 
the other hand, Y needed 0.4C consistently and Z were bursty, 

5 it makes sense to assign rY=0.4C and rz=0.6C. 

If both Y and Z are mildly bursty, then rY=0.8C and 
rz==0.2C could be assigned. If it turns out that Z has a 
large burst and Y does not, then the switch throughput may 
have had been greater if rY=0.4C and rz=0.6C. It also might 

10 happen that the optimal weights are rY=0.9C and rz=0.1C. 

In general, if we are passing out bandwidth at output 
II, and we have rY=y and rz=z, where x+z=C-8: so far, we must 
decide which connection gets the final e of bandwidth. A 
question to ask is which connection is more likely to use 

15 this bit of bandwidth without regard to the sum of the rates 
in the various input nodes. Whichever flow has a higher 
probability of using this bandwidth should receive it. 

Thus, the above examples show the advantages of 
20 assigning PGR rates to connections instead of normal rates. 
This does mean that the input nodes must run some type of 
rate assignment process to determine exactly what rate 
various backlogged flows receive, but this freedom allows the 
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input nodes to optimize themselves to a certain extent. This 
has potentially significant advantages for certain connection 
configurations . 

5 6.4 Rate Assignment: Process A 

This process is designed to calculate the rates assigned 
to each connection in a switched system. This process uses 
an estimate of the number of cells each connection is 
expected to have arrive during the upcoming frame to predict 
10 the amount of bandwidth desired by each connection. 

Bandwidth is then distributed in a globally optimal way 
similar to the EGPS process to ensure max/min weighted 
fairness between connections. 

The following are the general rules used for process A: 

15 

• The process is interval based. Rates assigned to 
connections are valid for the duration of the 
interval . 

• The intervals are synchronous. All of the rates are 
20 updated simultaneously and are valid for the same 

period of time. 

• Bandwidth is distributed in a weighted fair manner. 
If two connections are being assigned rates^ the 
ratio of the rates is kept equal to the ratio of the 

25 weights as long as possible. 
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• The bandwidth a connection receives is capped by the 
expected amount of bandwidth it would use over the 
course of the interval. The number of cells queued 
for a connection added to the expected number of 

5 cells to arrive during the interval dictates the 

maximum rate a connection should be given. 

• At least initially, neither input nodes nor output 
nodes should be oversubscribed. 

• The process is run in a central location. 

10 • After assigning rates in the above manner, any excess 

bandwidth in the output nodes is distributed to the 
connections passing through them. This extra 
bandwidth is distributed without worrying about 
overloading input nodes or exceeding the expected 

15 bandwidth desires of the connections. 

The actual process is an interval-based version of EGPS 
(section 6.1.2.2). Before assigning rates, an estimate of 
how many cells each connection will have is made. Dividing 

20 these numbers by the interval length gives the maximum rates 
that will be assigned to each of the connections. Once a 
connection reaches this bandwidth, it is not assigned any 
additional bandwidth until the overbooking phase. A 
connection can also be frozen if an input or output node it 

25 passes through saturates. After all of the connections have 
been frozen, the output nodes are examined. If an output 
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node has not saturated, then the excess bandwidth is 
distributed between the connections that pass through it. 
This occurs even if the new rates exceed the maximum rates of 
the connections or if the sum of the rates at an input node 
5 exceed the capacity of the inputs. The additional bandwidth 
is only used if the input node does not have enough other 
traffic to send. 

6.4.1 Maximum Bandwidth For A Connection 
10 One of the features of process A is that it estimates 

the bandwidth that each connection would use if it could. 
Let qic be the number of cells in the queue of connection k at 
the beginning of the time interval. Let a^ be the expected 
number of arrivals over the next interval. Thus, if 
15 connection k had the opportunity to send as many cells as it 
could over the next interval, it would expect to send qk+a^ 
cells, unless this number was greater than the number of 
slots in the interval. If the interval was T units long 
(seconds, cell slots, whatever is most convenient) then the 
20 maximum bandwidth process A will give a connection is 

BWr" = ■ (24) 
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Bandwidth in excess of this value may be assigned to this 



connection if all of the connections are frozen. 



5 



6.4.2 Expansion Factors 



Recall that in single node GPS (section 6.1.1), each 
connection is assigned a weight cpi. The GPS notion of 
fairness is a weighted fair bandwidth distribution between 
the backlogged connections. If two connections, i and are 
10 backlogged and they are assigned rates ri and rj respectively, 
then the rates are fair if the rates are in proportion to the 
ratio of their weights. That is. 




(25) 



15 



This must be true for any pair of backlogged connections. 



Any number of rate sets will satisfy this constraint. One 



set of rates is the set that sums up to the server's 



capacity. 



20 



So, letting B be the set of connections that are 



backlogged (and therefore will receive non-zero rates) , the 



rate of a connection in B is 
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<t>i c 
— — c 



(26) 



jeB jeB 



where b-C 2^.^^<f>j is the exp 



ansion factor. Any (non-zero) 



rate can be determined as r^^b(f>j^. Notice that this satisfies 
the GPS notion of fairness since 



Note that the rates must be re-determined any time the set B 
changes ;r which could be quite often. 

A useful way of thinking about what is happening is to 
give each backlogged connection a rate r^-s-^-f where z is an 
expansion factor. By starting e at zero and slowly 
increasing it, the bandwidth assigned to each flow slowly 
expands. At all times, the bandwidths are GPS fair because 
each one is proportional to its weight <pi. e can be 
increased until the sum of the rates reaches the capacity of 

the server. The e for which = C is jb. 
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The situation considered here is different from the 
single node GPS server in two ways. First, this is a two- 
level system, so nodes cannot act in isolation. Steps are 
taken to ensure that the sum of the rates entering an output 
5 node does not exceed the node's capacity. Second, rates are 
assigned for an interval. This means that the rates assigned 
to flows do not change when the set of backlogged connections 
changes . 

In this case, there is no single expansion factor for 
10 the system. Instead, each connection is assigned its own 

expansion factor. All of these factors start at zero and are 
increased together until one of the nodes saturates or one of 
the connections is maximized. At this point, the expansion 
factor (s) of the affected connection (s) (the connections that 
15 pass through the saturated node or the connection that is 

maximized) is (are) frozen. The non-frozen expansion factors 
are then increased until another node saturates or another 
connection is maximized. Using expansion factors is 
convenient because all of the non-frozen factors are equal 
20 and are increased by the same amount in each iteration. The 
process stops when all of the connections are frozen. At 
this point, any connection y, with expansion factor by and 
weight ?>y, is assigned a rate ry= by(Py. 
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6.4.3 Process Steps 
The following quantities are used in process A. 

L: set of connections passing through the switch 
-j^frozen. connection that have been frozen 

-^aiive^-^y -^frozen. connections whose bandwidth can be 

increased 

(backslash, \, denotes the difference between two 
sets. For two sets E and E\F-{x:x E and x F} . ) 
N: set of nodes, both input and output 
j^frozen. nodes that have reached their capacity 

j^aiive^^^^frozen. nodes that have not reached 

capacity 

Cj : capacity of node j 

Lj: set of connections that pass through node j 

Note that L and N are fixed sets. and n^^°'^^ start at 

{cp} (the null set) and grow towards L and N. l^^^""^ and n^^^""^ 

start at L and N and shrink towards { (p } with every 
iteration. 

Process A proceeds as follows. 
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Step 0: 



L^^^^^=L (28) 
N^^^^^=N (29) 



Calculate BW^^ (the maximum bandwidth) for every 
connection. For every connection, b'^^'' - BW^"^ j (l>f^ , 
j^remainins ^^^^ ^ and 6^=0. 6™''/ ^ fixed quantity, is the 
amount connection k may expand before it equals BW^^"^ . 
^remaining ghrinks to zero with every iteration, bl is the 
current expansion coefficient for each connection. The 
expansion coefficients of all non-frozen connections 
grow at the same rate. When connection k is frozen, bl 
is frozen. 

For every node 

kelj 

This is the amount each of the connections passing 
through a node may expand before the node reaches its 
capacity. This quantity changes with every iteration. 
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Step 1: 



Find the minimum b] of all the nodes in N^-^^""^. Find the 
minimum of all the connections in L^^^""^ , Define 

5 b'^'' as the lesser of these two minima. This is how much 

the non-frozen connections can expand before either a 
node saturates or a connection is maximized. 



Step 2: 

10 



For all connections in l^^^^^ 



Z?;:=6;-f&"" (31) 
and 

^ ^ ^ remaining . ^remaining ^ min (32) 



This updates the Z?^ values for all of the non-frozen 
connections . 



20 Step 3: 



For each node in U calculate 
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(33) 



If node j bottlenecks on this iteration (the previous 
5 value of bj equaled Z?"™") then the new bj (determined in 

this step) will be zero. 

Step 4 : 

10 Define N^"={ nodes for which bj^O] (bn for bottlenecked) 

and L^^= (J Lj . L^"" is the set of connections that pass 

through a bottlenecked node. 

Define L^^^={ connections for which b~''^=^0} . This is the 
15 set of connection that have received all the bandwidth 

they expect to need. 

Now 

Lfrozen^^bnyL^ax^ (34) 

20 and 
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So 
and 

5 

Most of this work is done because a node can be frozen 
without being bottlenecked. This can occur when all of 
the connections passing through a node are frozen for 
10 some other reason • 

This step constructs these sets each time. It may be 
simpler to just update these sets each iteration. 

15 Step 5: 

If L^^^^'^^lcp} (an empty set) then GO TO Step 1. 

Step 6: 

20 

If an output node is not bottlenecked then it has 
capacity that has no chance of being utilized. The BW^^ 
value for each connection is based on an estimate^ 



^frozen^^^^^^ v^here LjCL^^^^^^}. (35) 



^alive^LXL^^^^^^ (3 6] 



N^li^^==N\N^^^^^^ (37) 
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however, so the actual traffic needing to be served will 
most likely be different. If the excess capacity of an 
output node is distributed, then the input nodes have 
the opportunity to adjust their rates to maximize the 
throughput of the system based on the actual traffic 
arrival pattern. There are many possible ways of 
passing out this extra bandwidth. One way is to award 
it based on the weights of the connections. To do this, 
calculate, for each output node, 



(This will be zero for bottlenecked nodes.) Thus, the 
new expansion factor for connection k, which passes 
through output node j , is 




n, extra 



(38) 




n, extra 



(39) 
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Step 7: 



For any connection there are two rates. The first, r^^ , 
is the rate a connection should receive as long as it 
has cells to send. The second, r^^^ , is the maximum rate 
a connection should receive. These rates are 



r,=bl<j>, (40) 

and 

rr=6fV,. (41) 



Some special cases have been ignored in the above 
process. For instance, if no connections pass through a node 
then this node should be placed in n^''''^^'' at the start of the 
process and eq. (26) should not be determined for this node. 
In addition, if BW^^^ is zero for some connection, it can be 
placed in in step zero. 



6.4.4 An Extension For Assigning Minimum Rates 
One extension to process A is to assign each connection 
a minimum rate, such as zero. As long as the minimum rates 
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at a node sum to less than the capacity of the node, redefine 
the initial capacity of the node (used in step 0) to be 



Cj=Cj-Y^rr . (42) 



The value of ^^^t be defined as ^ (BW^ -r^y /<^k 

(where x"" is the maximum of 0 and x) . The final rates given 
to the connections are 

r,=rr+HA (43) 

10 and 

^PCR^^^^n^j^PCR^^^ (44) 



Representative computer code to implement process A is 
shown in the attached Appendix. 

15 

6. 5 Rate Assignment: Process B 

The rate assignment processes described above have been 
optimized in terms of fairness and, to a lesser extent, 
throughput. Process B is a GPS fair process, which requires 
20 relatively fewer calculations than those described above. 
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Process B assigns rates to flows passing through one or 
more bottleneck points. Specifically, process B is used when 
the following general conditions apply: 



5 • Bandwidth is to be assigned to all of the connections 

at the same time. 

• The bandwidth is to be used for a specific period of 
time known as the "scheduling interval". At the end 
of this interval, all of the connections are to 

10 receive new bandwidth assignments, 

• The bandwidth is to be distributed in an EGPS fair 
manner. 

• Connections may have maximums placed on their 
bandwidth allocations . 

15 

Several variations on the general process are presented 
based on whether a single node is scheduling among several 
connections or whether a central location is scheduling 
input/output pairs. In this context, "connection" means 

20 VC's, VP's, or any of several varieties of VL's. In 

addition, "node" means some bottleneck point where cells must 
be scheduled. These might be input and output switch core 
links or they might be some other locations the switch. 

The basic idea of process B is to pass out bandwidth in 

25 a distributed weighted round-robin manner. Since switches 
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can only send whole cells, fluid flow models, which tell us 
to serve all of the head-of-line cells simultaneously at 
fractions of the link rate, are approximations. Since a 
single cell is the smallest unit a switch can actually 
5 transmit, then the smallest unit of data to consider is a 

single cell. In addition, a natural unit of time is the time 
it takes to send a single cell. 

Consider the number of cell departure opportunities in a 
scheduling interval. Call this number Max_Slots. Assigning 

10 bandwidth to flows for use over the scheduling interval is 
equivalent to awarding connections departure opportunities 
(slots) so that the total number of slots awarded is 
Max_Slots. It is possible to convert between slots and 
bandwidth using the number of bits in a cell and the overall 

15 link bandwidth. 

In order to award cell slots in a fair manner, a 
calendar with a length equal to Max_Slots can be constructed. 
If each connection has a weight (p±, such that the sum of the 
cpj sum up to 1 or less, then each connection will receive 

20 ((Pi) •(Max_Slots) (rounded to an integer) entries in the 

calendar. These entries are placed in the calendar so that 
each connection is as evenly distributed as possible, e.g., 
one entry per calendar slot. Each calendar slot is referred 
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to as a "day", though this is a reference unit only and does 
not mean a 24 hour period. In addition, any empty calendar 
slots are distributed evenly. By advancing day-by-day 
through the calendar, and awarding slots to whichever 
connection occupies each day, it is possible to achieve a 
relatively fair distribution of bandwidth. A weighted round- 
robin linked list can be used instead of a calendar. 



First consider an isolated node distributing bandwidth 
among several different connections. A total of Max_Slots 
slots are to be distributed among the connections. Each 
connection x will accept a maximum of Max{x) slots. The 
following quantities are used in the single node process B. 

Max(x): maximum number of slots that connection x will 

accept (proportional to BW^^^) . 
Slots (x): number of slots that have been assigned to 



6.5.1 Single Node Process B 



connection x. 



Starts at 0 and increases to at 



most Max (x) . 



Total: 



total number of slots that have been awarded. 



Starts at 0 and increases to Max Slots. 



Day: 



next slot in calendar to be considered. 



Calendar_Entry (k) : entry in the calendar for day k. 
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Calendar_Length: length of the calendar. Usually 

equal to Max_Slots. 

The process is as follows. It is assumed that the 
calendar has already been constructed and the variable "Day" 
already has some value. Unless this is the first run of the 
process, the value should be left from the last interval. 

Step 1: 

Calculate Max(x) for each connection 
Slots {x)=0 for each connection 
Total=0 

Step 2: 

IF ^Max(x) <Max_Slots (sum over all connections) 
THEN Slots (x) =Max (x) for all connections, END 

This step prevents the process from entering an infinite 
loop. Without this step, there is a possibility that 
all of the connections could reach their Max values 
before Total reached Max Slots. 
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Step 3: 



Flow=Calendar_Entry (Day) ; Flow equals the connection 
that occupies the current day 

IF Slots (Flow) <Max (Flow) AND Flow "Empty" 
THEN Slots (Flow) =Slots (Flow) +1, Total-Total+1 

If Slots (Flow) =Max (Flow) or the calendar slot is empty, 
then do nothing and move on to the next day in the 
calendar. 

Step 4: 

Day=(Day-hl) MOD Calendar__Length 

Increment the current day but wrap around to day 1 when 
the end of the calendar is reached. 

Step 5: 

IF Total=Max_Slots 
THEN END 
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ELSE GOTO step 3 

6,5-2 Single Node Process Example 
This example shows how the single node process B of 
section 6.5.1 works in practice. 

Weights: ^=0.4, (p^^O.!, (pc--0.1, and ^-0.1, Note that 

Scheduling interval: 20 cells. 
Calendar slots: A=8, B=4, C=2, and D=2 . 
Maximums: Max (A) =3 cells slots. Max (B) =20, Max (C) =10, 
and Max(D)=3. 

Calendar: see below. 









^A 






*A 


5 


^A 


'b 




'a 


10 


"a 








15 


^^A 








30 





Day pointer: Day 6. 
Process steps: 

0) Slots (A) =Slots (B) =Slots (C) =Slots (D) =Total=0 

1) Day=6, Slots (A) =1, Total=l 

2) Day=7, Slots (B)=l, Total=2 

3) Day=8, Slots (D)=l, Total=3 

4) Day=9, Slots (A) =2, Total=4 
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5) 


Day=10, 


empty 








6) 


Day=ll, 


Slots (A) 


=3, Total=5 






7) 


Day=12, 


Slots (B) 


=2, Total=6 






8) 


Day=13, 


Slots (C) 


=1, Total=7 




5 


9) 


Day=14, 


Slots (A) 


=Max(A) so slot refused 






10) 


Day=15, 


empty 








11) 


Day=16, 


Slots (A) 


=Max(A) so slot refused 






12) 


Day=17, 


Slots (B) 


=3, Total=8 






13) 


Day=18, 


Slots (D) 


=2, Total=9 




10 


14) 


Day=19, 


Slots (A) 


=Max(A) so slot refused 






15) 


Day=20, 


empty 








16) 


Day=l, 


Slots (A)= 


Max (A) so slot refused 






17) 


Day=2, 


Slots (B) = 


■4, Total=10 


~ViI 




18) 


Day=3, 


Slots (C)= 


2, Total=ll 




15 


19) 


Day=4 , 


Slots (A)= 


Max (A) so slot refused 


+^ 




20) 


Day=5, 


empty 








21) 


Day=6, 


Slots (A)= 


Max (A) so slot refused 


y. 




22) 


Day=7, 


Slots (B) = 


:5, Total=12 


r: 




23) 


Day=8, 


Slots (D)= 


■3, Total=13 


CI 


20 


24) 


Day=9, 


Slots {A)= 


■Max (A) so slot refused 






25) 


Day=10, 


empty 








26) 


Day=ll, 


Slots (A) 


=Max (A) so slot refused 






27) 


Day=12, 


Slots (B) 


=6, Total=14 






28) 


Day=13, 


Slots (C) 


=3, Total=15 




25 


29) 


Day=14, 


Slots (A) 


=Max (A) so slot refused 






30) 


Day=15, 


empty 








31) 


Day=16, 


Slots (A) 


=Max (A) so slot refused 






32) 


Day=17, 


Slots (B) 


=7, Total=16 
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33) 


Day= 


^18, 


Slots (D) 


=Max(D) so slot refused 


34) 


Day= 


^19, 


Slots (A) 


=Max (A) so slot refused 


35) 


Day= 


■20, 


empty 




36) 


Day= 


1, 


Slots (A)= 


:Max(A) so slot refused 


37) 


Day= 


2, 


Slots (B)= 


:8, Total=17 


38) 


Day= 


3, 


Slots (C)= 


■A, Total=18 


39) 


Day= 


4, 


Slots (A)= 


^Max(A) so slot refused 


40) 


Day= 


^5, 


empty 




41) 


Day= 


6, 


Slots {A)= 


^Max(A) so slot refused 


42) 


Day= 


7, 


Slots (B)= 


■S, Total=19 


43) 


Day= 


8, 


Slots (D)= 


Max(D) so slot refused 


44) 


Day= 


9, 


Slots (A)= 


Max (A) so slot refused 


45) 


Day= 


10, 


empty 




46) 


Day= 


11, 


Slots (A) 


=Max (A) so slot refused 


47) 


Day= 


12, 


Slots (B) 


=10, Total=20=Max_Slots 



Final totals: Slots(A)=3, Slots(B)=10, Slots(C)-4, and 
Slots (D)=3. 

Day pointer (starting point of next interval): Day 13. 

20 

Ideally, using EGPS and assuming a fluid flow model, 
connection A would receive 3 cell slots, connection B would 
receive 9.33 cells slots, connection C would receive 4.67 
cell slots, and connection D would receive 3 cell slots. At 
25 first, it would appear that it would be closest to ideal if 
connection C had received 5 slots and connection B had 
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received 9 slots (4.67 rounds to 5^ 9,33 rounds to 9). This 
should only happen two out of every three times, however. B 
should receive 10 and C receive 4 one out of every three 
times. The Day pointer for the next interval is pointing to 
Day 13, a C day. This ensures that C will receive 5 slots in 
the next interval (in the next interval Slots (B) =9, 
Slots (C)=5) . 

6.5.3 Single Node Process: Modification 1 
This is the first of two modifications that may reduce 
the complexity of the single node process B of section 6.5.1. 
This modification stems from the fact that the process will 
make it around the calendar at least once, assuming 
Calendar_Length=Max_Slots . If Max__Slots is a multiple of 
Calendar__Length, the process might make it around several 
times. This makes it possible to skip the first trip around 
the calendar and assign the connections the values they would 
have received. Thus, taking Calendar__Slots (x) to be the 
number of entries that connection x has in the calendar, 
insert this step into the single node process B of section 
6.5.1 between steps 2 and 3: 



-89- 



Docket No.: 06269/020001 

Step 2.5: 

Slots (x) =min{Calendar__Slots (x) , Max(x)} for each 
connection 

5 Total=^Slots(x) (sum over all connections) 

In the example, the process could be "jump-started" with 
Slots(A)=3, Slots{B)=4, Slots(C)=2, Slots(D)=2, andTotal=ll. 
Day would remain at 6. The process would then continue 
10 starting at step 2. 

6.5.4 Single Node Process: Modification 2 
This modification checks the possibility that the 
^ll process might get around the calendar several times before 

.£15 Total hits Max__Slots. As long as Total stays less than 
O Max_Slots, keep adding Calendar__Slots (x) to Slots (x) . This 

alternative step 2.5 shows how this modification works: 

Step 2.5: 

20 Temp_Slots (x) =min{ Slots (x) +Calendar_Slots (x) , Max (x) } 

for each connection, 
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Temp__Total=ETemp_Slots (x) (sum over all connections) 
IF Temp__Total<Max__Slots 

THEN Slots (x) =Temp_Slots (x) for all connections, GOTO 
5 Step 2.5 

Total==ZSlots (x) (sura over all connections) 

p. This modification would be most useful for sparse 

ffl10 calendars^ or when many of the connections have low maximums 
fit bandwidths. In both of these cases, many passes through the 

4l calendar would be necessary to reach Max_Slots. In the 

previous example, this step 2,5 would run three times (in the 
last run Temp_Total would exceed Max_Slots so Terap_Slots 
515 would be discarded) and then skip to step 4. 

5,5.5 Multiple Node Process B 
When multiple input and output nodes are being jointly 
and simultaneously assigned rates, a different variation of 
20 process B is needed. The key fact here is that there is not 
a single variable Total but rather a variable Total (y) for 
each node. No connection passing through node y can accept 
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additional slots if Total (y) =Max_Slots . In fact, each node 
may have a different Max_Slots value, Max_Slots (y) . Each 
connection is associated with both an input and an output 
node. Thus, the call to Calendar_Entry (Day) will return two 

5 pieces of information, Flow_From, the input node a connection 
passes through, and Flow_To, the output node a connection 
passes through. 

In order for a connection to accept a slot, three things 
should be true. The input node a connection passes through 

10 must not be saturated, the output node a connection passes 
through must not be saturated, and the connection must be 
below its maximum bandwidth. These conditions are equivalent 
to Total (Flow_From) <Max_Slots (Flow_From) , 

Total (Flow_To)<Max_Slots (Flow_To) , and Slots (x) <Max (x) . If 
15 all of these are true, then the slot is accepted and 
Total (Flow_From) ^ Total (Flow_To) , and Max{x) are all 
incremented. 

Stopping the process is more complex, since there is no 
single Total value to compare with Max_Slots. The process 
20 shouldn't stop when the first Total reaches its Max_Slots 

value and it may need to stop before all of the Total values 
reach their Max__Slots value. There are several possible 
stopping rules. First, as in process A, it is possible to 
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keep a set of active connections. When a connection reaches 
its Max value^ it is removed from this set. If a node 
reaches its Max_Slots value^ all of the connections passing 
through it are removed from the set. As long as this set is 

5 non-empty, there is at least one connection accepting slots 
and the process should continue. When the active connection 
set is empty, the process should stop. 

A second stopping rule involves keeping track of when 
the last slot was accepted. If the process makes an entire 

10 pass through the calendar without a slot being accepted, then 
the process should stop. Implementing this stopping rule 
means defining a variable Last_Day, which is set to the 
current day whenever a slot is accepted by some connection. 
When Day=Last_Day and the connection occupying Day is not 

15 accepting slots, then the process is stopped. 

There are several ways of constructing the calendar in 
the multi-node case. One method is to use one large, unified 
calendar, which contains all of the connections. Each entry 
has an input and an output associated with it. Note that 

20 this calendar is longer than Max_Slots. If there were N 

input nodes and N output nodes and each node had a capacity 
of Max__Slots, then the calendar would most likely be 
N»Max Slots long. Another method uses a separate calendar 
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for each input or output node. Entries in each calendar 
represent output or input nodes. Days from each of the 
calendars are examined in a round-robin fashion. A day is 
examined from calendar 1, then a day from calendar 2, etc. 

6.5.6 Multiple Node Process Modifications 
If there are open slots in the calendar^ then new, or 
expanding old, connections can claim them. This is one 
reason to distribute any empty slots evenly. That way, if 
any connections are added, they can be distributed. After 
adding and removing enough connections, the distributions of 
the slots may become uneven. When this measure exceeds a 
predetermined threshold, a new calendar may be constructed. 

Constructing a new calendar can be done during the time 
between process runs. Still, reconstructing the calendar 
should be avoided if possible, since the process' fairness 
increases as more runs are performed using the same calendar. 

The next issue involves two kinds of minimum bandwidths. 
The first is the smallest bandwidth a connection may receive. 
At this point, this minimum bandwidth is 1 slot per 
scheduling interval. Increasing the scheduling interval will 
cause this minimum bandwidth to decrease, but the ability of 
the process to react to bursty traffic will suffer as this 
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interval grows. The second minimum bandwidth involves 
guaranteeing a certain number of slots to certain 
connections. This feature can be added by inserting a new 
step at the beginning of the process. This step would set 
Slots (x) =Min (x) for each connection {Min(x) being the minimum 
number of slots for connection x) . Total would then be 
updated. 

Finally, there is the issue of PGR rates, or bandwidth 
above what a connection or node expects to be able to handle. 
After the multi-node process has stopped, it would be a 
simple matter to restart the process with the saturated 
output node [Total {Flow_To) =Max_Slots {Flow_To) ] being the 
only reason a connection would not accept a slot. 

6- 6 Rate Assignment: Process C 

The foregoing rate assignment processes depend on a 
central scheduler making global rate calculations for all 
input and output ports. This may not be feasible in all 
cases. There is a need for processes that can make rate 
calculations in a distributed manner. Process C makes such 
calculations. Process C attempts to be GPS fair to all of 
the connections, but since no central scheduler oversees the 
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rate assignment process, there may be situations where 
inequities occur. 

Process C includes two phases, a phase at the input 
nodes and a phase at the output nodes. In the first phase 

5 (the input phase) , each input node independently determines 
the bandwidth {the rate) it would give each of its 
connections if it were the only bottleneck in the system. 
These rates depend on an estimate of the amount of bandwidth 
each connection would use if it were to receive all of the 

10 bandwidth it desired and a set of fixed weights for the 

connections. These input node rates and maximum bandwidth 
estimates are forwarded on to the appropriate output nodes. 

In the second phase of process each output node 
determines a rate for the connections passing though it. 

15 These rates are independent of the rates the other output 

nodes calculate. These rates depend on the maximum bandwidth 
estimates and the input node bandwidth estimates, however. 

Within a node, the rates are determined based on a set 
of fixed weights and a set of thresholds. The weights, (p^, 

20 are used in the normal GPS way - as proportionality constants 
when dividing up bandwidth in a weighted fair manner. The 
thresholds form a series of plateaus for the bandwidth of 
each connection. Each connection x passing through a node 
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has a series of thresholds Ti(x), T2(x), etc. Bandwidth is 
distributed to a connection in a weighted fair fashion (GPS 
fair way^ proportional to the weights) until the connection 
reaches the next threshold. When a connection reaches a 

5 threshold, say T2(x), it is temporarily frozen and the 

connection stops accumulating bandwidth. When all of the 
other connections reach their T2 thresholds, then all of the 
connections can begin accumulating bandwidth again. 

Thus, all of the connections start with zero bandwidth 

10 and build up to their Ti threshold. None of the connections 
advance beyond their Ti threshold until all of the 
connections have reached their Ti threshold. All of the 
connections then advance towards their T2 values. After all 
of the connections have reached their T2 thresholds, the 

15 connections advance towards their T3 thresholds, and so on. 

6-6.1 Input Node Phase 
As in processes A and B, process C determines the 
estimated bandwidth that each connection would use if it 
20 could. This value, Max(x) for a connection x, is based on 
the queue length and recent cell arrival pattern of each 
connection. The bandwidth is split up between connections in 
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a weighted fair manner with thresholds. The thresholds for 
the Input Node Phase are as follows: 

Ti (x) = Min (x) 
5 T2{x)- Max(x) 

T3(x)- C 

where Min(x) is the minimum bandwidth for connection x. The 
sum of the minimum bandwidths is less than the capacity of 
10 the node, which is value C for all of the nodes. Note that 
it is possible for a connection to receive a rate greater 
than Max(x), which occurs when ^Max(x)<C. In pseudo-code 
this phase can be written as follows: 

15 IF Y.^,(y)>C 

THEN BW_In(x)=0 for all x, Distribute (BW=C, 
Threshold==Ti) {See note 1 below} 

ELSE IF J^T,(y)>C 

THEN BW_In(x)=Ti (x) for all x, 

20 Distribute (BW=(C-^Ti(j;)), Threshold^Ts) 

{See note 2 below} 
ELSE BW In(x)=T2(x) for all x. 
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Distribute {BW=iC-^T^(y)) , Threshold-Ts) 
{See note 3 below} 

BW__In(x) is the input node's estimate of the bandwidth 
5 available for connection x. After BW__In(x) has been 

determined, it is sent, along with Max(x), to the output node 
that connection x passes through. 

Note 1 : In this case, there is not enough bandwidth 
CMO available to give every connection the amount of bandwidth 
J that is equal to their Ti threshold. So, all connections 

start with no bandwidth and the Distribute subroutine divides 
the bandwidth C among the connections in a fair manner. The 
5l Distribute subroutine will not award any connection more 

J:; 15 bandwidth than its Ti value, so any given connection "x" will 
Q end up with an amount of bandwidth between 0 and Ti(x) . 

Note 2 : In this case, there is enough bandwidth to give 
every connection at least an amount of bandwidth equal to its 
Ti threshold, but not enough to give every connection the 
20 amount of bandwidth equal to its T2 threshold. Every 

connection starts with an amount of bandwidth equal to its Ti 
threshold and the Distribute subroutine passes out the 
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remaining bandwidth, (C-^T,(j^)), in a fair manner. No 

connection will end up with bandwidth exceeding its T2 
threshold. Any given connection "x" will end up with an 
amount of bandwidth between Ti(x) and T2(x). 

5 Note 3 : In this case, there is enough bandwidth to give 

every connection at least an amount of bandwidth equal to its 
T2 threshold, but not enough to give every connection amount 
of bandwidth equal to its T3 threshold. Every connection 
starts with an amount of bandwidth equal to its T2 threshold 

10 and the Distribute subroutine passes out the remaining 

bandwidth, (C-^T2{y))f in a fair manner. No connection will 
end up with bandwidth exceeding its T3 threshold. Any given 
connection "x" will end up with an amount of bandwidth 
between T2(x) and T3(x). 

15 

The subroutine Distribute (BW=R, Threshold=Tj ) 
distributes an amount of bandwidth equal to R in a weighted 
fair manner, allowing no connection to exceed its Tj 
threshold. The subroutine starts with whatever the current 
20 BW__In values are and adds bandwidth to them. If needed, the 
actual bandwidth calculations can be done in some low 
complexity manner, such as process B. 
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6.6.2 Output Node Phase 
At the output node, all of the input node bandwidth 
estimates and the connection bandwidth need estimates are 
collected. The output node determines bandwidth allotments 
for each connection based on the following thresholds: 

Ti(x)= Min(x} 

T2(x)- Min{BW_In(x) , Max(x)} 
T3(x)= Max(x) 

T4(x)= Max{BW_In(x) , ]y[ax(x)} 
T5(x)- C 

In pseudo-code, BW_Out (x) can be determined as follows. 

IF 2;Ti(y)^c 

THEN BW_Out(x)=0 for all x. Distribute (BW=C, 
Threshold=Ti) {See note 1 below} 

ELSE IF J]T2(y)>C 

THEN BW_Out (x)=Ti(x) for all x, 

Distribute (BVI=(C-^T^(y)) , Threshold=T2) 
{See note 2 below} 
ELSE IF ^T3(y)>C 



Docket No, : 06269/020001 

THEN BW_Out (x)=T2(x) for all x. 

Distribute (BW= (0-^X2(3;)) , Threshold=T3) 
{See note 3 below} 
ELSE IF ^T,(y)>C 
5 THEN BW_Out {x)=T3(x) for all x, 

Distribute (BW-CC-^TjCj)) , Threshold=T4) 

{See note 4 below} 
ELSE BW_Out (x)=T4(x) , for all x. 

Distribute (BW-(C-^T4(;;)) , Threshold-Ts) 
10 {See note 5 below} 

BW_Out(x) is then sent back to the appropriate input. 
The inputs should be careful not to allow any connection to 
exceed its BW_Out rate. 

15 

Note 1 : In this case, there is not enough bandwidth 
available to give every connection amount of bandwidth equal 
to their Ti threshold. So, all connections start with no 
bandwidth and the Distribute subroutine divides the bandwidth 
20 C among the connections in a fair manner. The Distribute 

subroutine will not award any connection more bandwidth than 
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its Ti value. Therefore, any given connection "x" will end 
up with a an amount of bandwidth between 0 and Ti (x) . 

Note 2 : In this case, there is enough bandwidth to give 
every connection at least an amount of bandwidth that is 

5 equal to its Ti threshold, but not enough to give every 

connection amount of bandwidth equal to its T2 threshold. 
Every connection starts with an amount of bandwidth equal to 
its Ti threshold and the Distribute subroutine passes out the 
remaining bandwidth, (C-^T,(y)), in a fair manner. No 

10 connection will end up with bandwidth exceeding its T2 

threshold. Any given connection "x" will end up with an 
amount of bandwidth between Ti(x) and T2(x), 

Note 3 : In this case, there is enough bandwidth to give 
every connection at least an amount of bandwidth equal to its 

15 T2 threshold, but not enough to give every connection amount 
of bandwidth equal to its T3 threshold. Every connection 
starts with an amount of bandwidth equal to its T2 threshold 
and the Distribute subroutine passes out the remaining 
bandwidth, (C-^^TsCj))^ in a fair manner. No connection will 

20 end up with bandwidth exceeding its T3 threshold. Any given 
connection "x" will end up with an amount of bandwidth 
between T2(x) and T3(x). 
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Note 4 : In this case, there is enough bandwidth to give 
every connection at least an amount of bandwidth equal to its 
T3 threshold, but not enough to give every connection amount 
of bandwidth equal to its T4 threshold. Every connection 

5 starts with an amount of bandwidth equal to its T3 threshold 
and the Distribute subroutine passes out the remaining 
bandwidth, iC-^T^(y))r in a fair manner. No connection will 
end up with bandwidth exceeding its T4 threshold. Any given 
connection "x" will end up with an amount of bandwidth 

10 between T3(x) and T4(x). 

Note 5 : In this case, there is enough bandwidth to give 
every connection at least an amount of bandwidth equal to its 
T4 threshold, but not enough to give every connection amount 
of bandwidth equal to its T5 threshold. Every connection 

15 starts with an amount of bandwidth equal to its T4 threshold 
and the Distribute subroutine passes out the remaining 
bandwidth, {C-^T^(y))r in a fair manner. No connection will 
end up with bandwidth exceeding its T5 threshold so any given 
connection x will end up with an amount of bandwidth between 

20 T4 (x) and T5 (x) . 
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As in the input node, it is not expected that 
^Min(y)>C. T2 makes sure that bandwidth is going to 
connections that both think they need it (as represented by 
the Max(x) term) and think their input node will let them use 

5 it (as represented by the BW_In(x) term). After this second 
threshold, the bandwidth the output node is passing out may 
not be used by connections because either they do not need it 
(BW__Out (x) >Max (x) ) or their input nodes will not allow them 
to use it (BW__Out (x)>BW_In(x) ) . 

10 T3 is a value judgment. It is more likely that the 

BW_In(x) value will be below the actual bandwidth the input 
node could give a connection than the Max(x) value will be 
below the bandwidth a connection could use. BW_In is often 
low because some connections will have BW_Out (x) <BW_In (x) . 

15 This will free up some of the bandwidth that has been 

tentatively set aside for these connections. This freed up 
bandwidth can be claimed by connections with 

BW_Out (x) >BW_In (x) . It would, of course, be possible to make 
T3=BW__In(x) or even remove T3 entirely. 

20 

6.6.3 Process C Modifications 
Each of the nodes does not have to have capacity C. If 
node z has capacity Cz, it would be a simple change to use 
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the node specific values in the process. It is possible 
to use different sets of weights between each pair of 
thresholds. This would require each connection to have 
several weights but it certainly could be done. 

5 

6. 7 Rate Assignment: Process C+ 

Process C+ is related to process as described below. 
A distributed hardware architecture is also described below 
for implementing processes C and/or C+. 

10 Process C-f- approximates the bandwidth allocations of VLs 

(or connections or sets of connections) produced by process A 
with a lower computational complexity. Process C+ is 
designed around the assumptions that the VL rates are updated 
synchronously, the update period is fixed, and there is a 

15 central location where all, or most, of the calculations take 
place . 

Process C+ determines the PCRs for all of the VLs using 
several steps. First, each VL creates an estimate of how 
many cells it would like to send in the next update interval. 
20 These rates are known as the desired rates, r^^^. Second, 

these rates are sent to a central location where most of the 
rate calculations take place. Third, using the desired 
rates, the available bandwidth of each input port is 
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distributed fairly among the VLs passing through the port, in 
effect creating the requested rates, r^^"^. A single parameter 
per input port, the requested rate factor, x*, is found. The 
requested rates can be determined from this quantity, 

5 Fourth, using the desired and requested rates, the bandwidth 
of each output port is distributed fairly among the VLs, 
creating the granted rates, r^""^. Fifth, the granted rates 
are sent from the central location to the input ports. 

Fig. 21 shows a block diagram for a process C+ Central 

10 Rate Processing Unit (CRPU) , where most of process C+ may be 
performed. The notation will be explained in the sections 
that follow. The asterisk is used in Fig. 21 to mean "all." 
Thus, r-t' symbolizes the desired rates for VLs from input 
port i to all of the output ports. This usage should not be 

15 confused with the asterisk in x^ below. Information in Fig. 
21 flows from left to right. At the far left, the desired 
rates arrive from the input ports, where each input port 
determines a desired rate for each VL that originates at that 
port. This information enters an Input Port Rate Processing 

20 Modules (IPRPM) 100,101, each of which includes a 

Thresholding block 106 and a Distribute block 107. The 
IPRPMs, one for each port, find the rate factor values, x*, 
that define the requested rates. These x* values and the 
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desired rates are sent to appropriate Output Port Rate 
Processing Modules (OPRPM) 109,110. 

The OPRPMs (one for each output port) include 
Thresholding/Calculation blocks 111 and Distribute block 
pairs 112. These blocks use the desired rates and the 
requested rates, which they calculate from the requested rate 
factors, to calculate their own rate factor values, which 
they then convert to granted rates. The granted rates are 
grouped appropriately and sent to the input ports. 

This particular architecture features an amount of 
redundancy. All of the Distribute blocks are identical. All 
of the IPRPMs are identical, as are the OPRPMs. It may be 
possible to use fewer IPRPMs and OPRPMs by reusing these 
blocks for multiple input or output ports. This would 
increase the run time of processes C and C+. 

6.7,1 The Desired Rates 
At the start of the rate update process, the VLs 
calculate a desired rate. One factor affecting a VL's 
desired rate is the amount of traffic queued at the input 
port for that VL. If a VL has a lot of traffic waiting to be 
served, its desired rate will be high. If a VL has a small 
amount of traffic waiting, its desired rate will be lower. 
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The relationship between a VL's queue depth and its desired 
rate may be linear or it may be a more complicated non-linear 
function. Another quantity that may factor into the desired 
rate is a minimum rate. Certain connections may be so 
5 important that they require some non-zero rate even if they 
have no traffic currently queued. If the update interval is 
long, then prediction may play a part in the desired rate as 
well. For example^ if the update interval is only 256 cell 
slots, then prediction may not be necessary and the few 
OlO minimum rates needed will be small. 

ff^ As an example, consider a VL going through input node i 

and output node j . Assume that the VL has a queue depth of q 
i;: and a minimum rate of m. Letting M be the interval length, 

}^ one possible desired rate formula would be 

Cl5 

O r'^'(fJ) = min{M,max{m,^}} . (45) 

The min{} term in equation (45) prevents the VL from 
asking for more cell sending opportunities than there are in 
20 an interval. 
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6.7.2 Sending Desired Rates to A Central Location 
After all of the VLs at an input port have determined 

their desired rates, the desired rates must be sent to the 
central location (the CRPU) where the rate determinations 
take place. The data may be of any format. It should be 
capable of handling every VL requesting M slots. At the 
CRPU, each IPRPM needs information from only one input port. 
So, the payload of each cell received by the CRPU only needs 
to be routed to a single IPRPM. 

6.7.3 IPRPM Calculations - the Requested Rates 
Process C+ considers each input port independently even 

though the desired rates from all of the VLs are all at one 
location. In contrast, process A considers all of the 
information from all of the ports simultaneously. The 
independence of process C+ reduces the number of operations 
that process C+ needs to perform and allows this portion of 
process C+ to be operated in a parallel manner. 

At each IPRPM, the data from one input port is used. 
The desired rates of the VLs passing through the port are 
examined and the bandwidth of the input port is divided among 
the VLs. Each VL will be awarded a requested rate, r^'^'^iirj) . 
The sum of these rates is equal to, or less than, the 
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capacity of the input BTM link, C. These rates represent how 
the input port would divide up its bandwidth if the output 
ports were not a concern. When examining a specific input 
node, the set of desired rates plays a role. If the sum of 

5 the desired rates is equal to, or less than, C, then all of 
the requested rates equal the desired rates. If the sum of 
the desired rates is greater than C, then the bandwidth of 
the input node is divided up in a weighted fair manner (using 
a fixed set of weights {(pt^\ i,j =1, .... N}) among the VLs, 

10 with no VL receiving more than its desired rate. 

The following pseudo-code shows how this phase of the 
process works. The subroutine Distribute (BW=R, Limit=L, 
Weights=^2?) returns a single parameter, x*^^. , known as the 
j:equested rate factor. The requested rate for each VL can be 

15 determined from this factor as 

r-\ij) = mm{<l>,xl__,,r'^{i,j)}. 

The Distribute subroutine distributes an amount of 
20 bandwidth equal to R in a weighted fair manner, using the 
weight vector q), allowing no VL to accept additional 
bandwidth exceeding its entry in a limit vector L. The inner 
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workings of this subroutine are discussed below. For input 
port i, 

THEN x*^,.=Distribute (BW=C, Limit^r'*®^ Weights=cp) 
ELSE ;c*^,.=very large number 

If ^Jjr'^''(iJ)<C, then the sum of all the desired 

bandwidth is less then the bandwidth available. So, every VL 
should get a requested rate equal to its desired rate. Since 
r'-"'(ij) = mm{<^gx*„^,y''{i,j)}r setting high means r"^=r^^^ (this 
corresponds to the ELSE clause) . If there is not enough 
bandwidth to let r'^^"?=r^^^ for every VL, then the THEN clause 
calls the Distribute subroutine to divide the available 
bandwidth, C, among the VLs . The Distribute subroutine does 
this by finding the optimal x* value. 

The single parameter x* is used because it is quicker 
and easier to pass than numerous requested rate values. In 
Fig. 21, the IF-THEN statement above is performed by 
Thresholding block 106. If needed, this block passes any 
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necessary information into Distribution block 107, which 
perforins the Distribution subroutine. 

6.7.4 OPRPM Calculations - The Granted Rates 
5 At this point, the desired rates and requested rate 

factors are sent to the OPRPMs. At an OPRPM, the requested 
rates are determined for the VLs terminating at the port from 
the requested rate factors. Now, armed with estimates of 
what rates all the VLs could use (the r^^^ rates) , and what 
10 each input node could give each VL (the r""^"^ rates), it is 
time to calculate the rates the output ports will grant to 
the VLs, These rates are the granted rates, r^^^. The output 
port phase of the process uses the four sets of thresholds 
shown below for output port j . 

15 

To(i)= 0 

Ti(i) = min{r^^^(i, j),r^^^(i, j) } 
T2(i)= r^"^(i, j) 
T3(i)= C 

20 

When passing out the bandwidth of an output port, all of 
the VLs must reach any threshold before any VL moves beyond 
the threshold. For instance, if the capacity of a port lies 
between the sum of the first threshold and the sum of the 
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second threshold {^.T^ < C <'^T^ ) , then all of the VLs will 
start at Ti and increase in proportion to their weights from 
there. If a VL reaches T2, it is frozen and does not move 

* 

beyond T2- In pseudo-code, the granted rate factor, ^out=j r 
5 can be determined as follows. 

IF l!lT,(i)>C 

THEN k=0 {Case 0 below} 

ELSE IF Xr=/2(0^C 

10 THEN k=l {Case 1 below} 

ELSE k=2 {Case 2 below} 

Now, 

;c*^^^^.=Distribute(BW=(C-2TJ, Limit=Tk+i-Tk, Weights=(p) (47) 
15 and 

r^'(/,y>T,0) + min{^^,xl=,,T,.,0-)-T,(;)} • (48) 

The object of the foregoing psueodcode is to have the 
granted rates for all of the VLs fall between the same pair 
20 of thresholds. That is, all VLs passing through the same 

output port should get a granted rate that is between their 
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To and Ti, Ti and T2, or T2 and T3 thresholds. There should 
not be a VL with a granted rate between its To and Ti 
thresholds and another with a granted rate between its Ti and 
T2 thresholds. The IF-THEN-ELSE statements in the pseudocode 
5 determine if all of the granted rates will be between To and 
Ti (case 0), Ti and T2 (case 1), or T2 and T3 (case 2). To 
fall under case k, there must be enough bandwidth to give 
each VL at least a granted rate equal to its Tk value, but 
not enough to give each VL a granted rate equal to its T^+i 

"?10 value. Thus, every VL starts with a bandwidth equal to its 

m Tk value and the remaining bandwidth of (C-2^TJ is divided 

among the VLs by the Distribute subroutine in a fair way. 
Each VL may receive at most additional bandwidth of T^+i-Tk^ 

1 since no VL may exceed its Tk+i threshold. The granted rates 

>Il5 are calculated using equation (47) . 

p{ The IF-THEN-ELSE code is performed by the 

Thresholding/Calculation blocks in Fig. 21. These blocks 
pass the necessary information to the Distribute blocks, 
which find the granted rate factors . The 

20 Thresholding/Calculation blocks 109 then determine the 
granted rates. 
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6.7.5 Sending Granted Rates to Input Ports 
After all of the granted rates have been determined, the 
granted rates must be sent back to the appropriate input 
port, where they can be used by the local schedulers. The 
5 granted rates for each input port are gathered together 

before they can be sent to the input port, since each input 
port has a granted rate at every OPRPM. The data may be of 
any format . 

10 6.7.6 The Distribute Subroutine 

The main computational burden of process C+ is in the 
Distribute subroutine (a.k.a. the Averaged Generalized 
Processor Sharing, or AGPS, process) . Since the Distribute 
subroutine runs many (e.g., 2N times where N is the number of 

15 ports) times per update interval, this subroutine forms the 
heart of process C-+-. If this subroutine can be performed 
quickly, then process C+ can be performed quickly. 

There are some differences between the way the 
Distribute subroutine has been defined here and the way it is 

20 defined in later sections, even though the basic purpose of 
the subroutine remains unchanged. Most importantly, the 
Distribute subroutine described here returns a rate factor 
(x^) that can be used to calculate rates. Versions of the 
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subroutine that are described later determine x*, but then 
determine and return all of the rates. 

A hardware implementation may be used to implement the 
Distribute subroutine. The Distribute Subroutine Module 
5 (DSM) architecture described here performs the Distribute 
subroutine operations in parallel. Before discussing this 
architecture, however;, it is necessary to describe what the 
Distribute subroutine does. 

Fundamentally, the equation underlying the Distribute 
10 subroutine is constructed of N functions of the form 

r,{x) = mm{^,x, ju.} , (4 9) 

where each function corresponds to a VL and i denotes the VL 
15 bound for output port i if this calculation is being done in 
an IPRPM or the VL coming from input port i if this 
calculation is being done in an OPRPM. ju equals the bound on 
the amount of additional bandwidth the VL can accept before 
it hits its limiting threshold (//i=Limit (i) =Tjt+i (i) -Tjc (i) ) 
20 while p is the weight of the VL. The basic shape of one of 
these functions is shown in Fig. 22. 
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Summing the N functions results in a new function that 
determines how the port's bandwidth will be divided, 

1=1 

5 

The Distribute Subroutine finds the value of x for which R(x) 
equals the amount of bandwidth that needs to be passed out, 
D. This function is shown in Fig. 23. As x is increased, a 
VL accepts bandwidth at a rate proportional to its weight 

10 (the (piX term in eq. (49)) until it reaches its limit (the /Ji 
term in eq. (49)), after which it does not accept any 
additional bandwidth, no matter how large x becomes. Thus, 
the slope of R(x) decreases with x because more VLs are 
frozen for larger x values. When R(x)=D, all of the free 

15 bandwidth has been passed out. A closed form solution to 
this problem does not exist. 

The most convenient form to write the problem in is 

g(x) = D-R(x). 

20 

In this form, the fundamental task becomes finding the root 
of the function of eq. (5) . Let x* be the value of x such 
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that g(x*)=0. Once has been found, the amount of extra 
bandwidth VL i receives is . Note that g(x) is not 

dif ferentiable at the points where K=ld±/(pi. To deal with 
this, the slope of ri(//i/^i) will be taken as zero. Otherwise 
5 this finite set of points will be ignored. 



The first step towards finding x^ in a parallel manner 
is to reformulate R(x). To this end, examine 
ri(x)=min{^2?iX,//i} . Note that the cut-off, where the minimum 
shifts from ^iX to //i, occurs at ^=^±1 (pi (see Fig. 1 also) . 
Thus, ri(x) can be written as 



6.7.7 Reformulating R(x) 



r. {x) = s^x + m- 



(52) 



15 



where 




(53) 



and 



20 




(54) 
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Note that Si and mi are functions of x even though this 
dependence will often be suppressed. Now, rewriting R(x) 
gives 

i=\ 

N 

i=l 

N N 



By writing R(x) this way, only a single multiplication, 
instead of N multiplications, is needed. 

5.7.8 Newton-Raphson Method 
One method for finding the roots of an equation is 
Newton's method (i.e., the Newton-Raphson method). This 
method involves a sequence of guesses of x which approach 
The update formula is 



where f{x) is the equation to solve. 
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To find the root of g(x) (eq. (51)) start by rewriting 
g{x) using equation (56), 



g(x) = D-R{x) = D-xf^s,-f^m. . (57) 

5 

Now the derivative of g(x) with respect to x is 



10 Thus, 
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N N 



1=1 

N N 



1=1 



/=1 /-I 

N 



/=1 

N 

D~J^m.{xj^) 



(5 



So given Xk, Si and mi are determined for VLs 1 through N by 
determining whether Xk is greater than or less than jUi/g>i. 
After summing the Si and mi values, x^+i can be determined 
using equation (59) . This process can be performed in 
parallel by determining all of the Si and mi values 
simultaneously , 

6.7.9 Block Diagram of Distribute Subroutine 
A block diagram of a Distribute Subroutine Module (DSM) 
is shown in Fig. 24. This module can be used to find x*. 
This figure shows one implementation of the Distribute blocks 
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in Fig. 21. Note that all of the and ju± values are input 
into the appropriate comparison blocks. The DSM could 
calculate the jUi/(pi values or they could be read in. Note 
that the values l/(pi can be predetermined and updated as the 
weights change, e.g., when connections are added or deleted. 

Some method of stopping the iterations could be used to 
detect when Xk reaches, or comes close to, x^. For now, it 
will be assumed that a fixed number of iterations, say five, 
will be run in all cases. While this many iterations may be 
more than necessary in some cases and not enough in others, 
this simplification allows the maximum run time to be 
consistent in all cases. The initial guess should be 




The value can be predetermined to speed the operation 

of the process. 

The main computational elements of this architecture are 
the two sums and the divide in the update eq. (59) . If the 
adders are constructed using a cascade of two input adders, 
then N-1 adders are needed for each. A total of 2N-2 two- 
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input adders would be needed. The complexity of the division 
will depend on the number of bits used for the data, 

5. 8 Rate Assignment: Process D 

Process C runs the Distribute subroutine, or DSM, to 
pass out bandwidth to connections passing through each node. 
Process D details several possible implementations of the 
DSM. In all cases, it is assumed the DSM knows the following 
information: 

• X, the set of connections 

• BW, the set of current bandwidths for the connections 

• T, the set of thresholds for the connections 

• (pf the set of weights for the connections 

• C, the capacity of the node. 

Some or all of this information may be passed into the DSM 
with each call and some of it may be given to the DSM on a 
semi-permanent basis when a connection is established or 
removed. 

The foregoing information is somewhat different from the 
information that is being passed to the DSM in process C. 
For instance, in process C, C-SBW(x) was passed. In process 
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D, C and BW(x) values are passed separately and C-I:bW(x) is 

determined in the DSM. 

Connection x has bandwidth BW(x) (<T(x)) before the 
Distribute subroutine is run. The DSM will return a new set 
of bandwidths such that BW(x)<T(x) for all x and 5]BW(x) = C. 

allx 

The new rates are returned in the BW(x) variables. Note that 
the bandwidths assigned to a connection are not decreased by 
the DSM. The rates will only increase or stay the same. 

6.8.1 First Implementation Of The DSM 
The first process finds bandwidth in a manner similar to 
process A, i.e., it slowly grows the rates until they hit a 
limit or the total supply of bandwidth is exhausted. On 
every pass through the process, an expansion coefficient is 
determined for each connection. The lowest of these 
coefficients is found and all of the connections' bandwidths 
are expanded by this amount. The connections that have 
reached their limit are removed from the set of connections 
that are still accepting additional bandwidth. The process 
repeats until all of the bandwidth has been passed out. This 
process operates as follows: 
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gaiive ^ connections that are still accepting 

additional bandwidth 

Step 1: 

FOR All X, IF BW(x)<T(x) THEN put x in B^^^""^ 

{If BW(x)<T(x), then connection x can be allowed to 

grow. If BW(x)=T(x), connection x should not be given any 

more bandwidth. } 

Step 2: 

FORK B--, b,.^(x)-BW(x) 

{For each non-frozen connection, calculate how much 
the connection's bandwidth could expand before reaching 
its limit . } 

Step 3: 

T(x)-X;BW(x) 



all X 



{Calculate how much all connections could expand 
before running out of bandwidth if no connection would 
reach its limit first.} 
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Step 4: 

b^in^min{bx, for all x b^^^^^} 

{Find the smallest amount of bandwidth expansion 
possible before some connection reaches its limit} 

Step 5: 

IF b'^<b'"^'' THEN GOTO Step 9 

{If b^Kb""^"", the node runs out of bandwidth to 
distribute before another connection would reach its 
limit . } 



Step 6: 



FOR X B^^^^'^r BW(x)=BW(x)+^^-b'^ 



{The bandwidth of each non-frozen connection is 
increased by g?^*h^^' } 

Step 7: 

FOR X B^^^^^, IF BW{x)=T{x) THEN remove x from B^^^^^ 
{Connections that reach their limit are frozen by 

removing them from the set of connections eligible to 

expand. } 

Step 8: 
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FOR X B^^^^% bx^bK-b"^^"" 

{Calculate how much non-frozen connections can now 
expand. } 

Step 10: 

GOTO Step 3 

Step 11: 

FOR X B^^^""^, BW(x)=BW(x)+ ^^^-b"' 

{The bandwidth of each non-frozen connection is 
increased by ^^x-b"^. Since there is no more bandwidth to 
distribute, the process ends.} 

Step 2 may require up to N (N>1) (the number of VLs, 
more if connections, and not VLs, are being scheduled) 
divisions. This step only has to be performed once, however. 
In addition, step 2 involves dividing by the ^ values. Since 
these weights do not change each time the DSM is called, it 
is possible to store a table of \/q> values. This would 
reduce the worst case step 2 to N multiplies. Step 3 
requires a single division but it has to be performed on each 
iteration. Since there can be up to N iterations in the 
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worst case, this step could add up. A potentially large 
number of multiplications occur in steps 5 and 10 and a min{} 
operation is performed in step 4. 

5 6.8.2 Second Implementation Of The DSM 

Bandwidth can be distributed using a calendar in a 
manner similar to process B. In fact, a single, slight 
modification to the single node process B is all that is 
needed. Instead of starting each Slots (x) value at 0, it 
10 should be started at the appropriate value for the bandwidth 
already awarded to connection x. (Total should not be 
started at 0 either) This process uses no divides, no 
multiplies, and no additions. It does use some IF-THEN' s and 
some increments. 

15 The amount of processing power needed for this process 

may vary greatly from one run to the next depending on the 
input variables. It may take a number of passes through the 
calendar to pass out all of the bandwidth. So, the worst 
case number of operations may be higher than the average 

20 number of operations. In addition to the work done in 
performing the process itself, the calendar needs to be 
constructed. This can occur in parallel to rate assignment. 
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It is possible to cap the number of times the process 
goes through the calendar. After this limit has been 
reached, the remaining slots can be assigned without regard 
to the connections' limiting values. This would give extra 
5 bandwidth to connections with large weights that have already 
reached their limits at the expense of connections with small 
weights that have not yet reached their limits. 

6.8.3 Third Implementation Of The DSM 
10 This process uses relatively few operations. After a 

quick check to find out what connections are already close to 
their threshold values, the bandwidth is divided up in a 
weighted fair fashion. 

15 Step 1: 

All X 

{Calculate how much extra bandwidth there is to 
divide up among connections.} 

20 Step 2: 

FOR All X, IF T(x)>BW{x)+;r THEN put x in b^^^""^ 
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{Only give more bandwidth to a connection if its 
current bandwidth is more that y below its limit.} 

Step 3: 

B^extra 

{Calculate how much expansion is possible.} 

Step 4: 

FOR X B^^^""^, BW(x)-BW(x)+^2?x*b 

{Give new bandwidth to eligible connections.} 

This process requires a single division. One drawback 
of this process is that it is possible for connections to end 
up with bandwidths that exceed their thresholds. In 
particular, connections with large weights and thresholds 
slightly greater than y plus the current bandwidth will 
receive bandwidth far above their threshold values, where y 
is an adjustable parameter, possibly zero. The next process 
addresses this problem by examining each connection, limiting 
each connection to T(x), and then making additional 
iterations to pass out any bandwidth this frees up. 
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6.8.4 Fourth Implementation Of The DSM 
In this process, all of the remaining bandwidth Is given 
to connections that are still accepting bandwidth without 
5 regard to the connections' bandwidth limits. Each of the 
connections is then examined- Connections that have been 
awarded bandwidth in excess of their limits have their 
bandwidth reduced to their limits. The extra bandwidth is 
returned to the unasslgned bandwidth bucket to be passed out 

10 in the next Iteration. These connections are also removed 
from the set of connections accepting more bandwidth. In 
short, the extra bandwidth is distributed to the connections 
and then any extra is returned. This is in contrast to the 
processes described above in sections 6.8,1 and 6.8.2. In 

15 those processes, bandwidth is distributed fairly. No 
connection receives bandwidth in excess of its limits. 

Step 1: 

BW^^^ =C-^BW(x) 

Ail X 

20 {Calculate how much extra bandwidth there is to 

divide up among connections.} 
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Step 2: 



FOR All X, IF BW(x)<T{x) THEN put x in B^^^""^ 



{Only connections that have not reached their 
bandwidth limit are eligible to receive additional 
bandwidth. } 

Step 3: 

{Calculate expansions factor. } 



Step 4: 



FOR X B^^^""^, BW(x)==BW(x) +^?> K-b 



{Increase bandwidth of each non-frozen connection by 
9 x-b. } 



Step 5: 

FOR X B^^^^'S IF BW(x)>T(x) THEN BW(x)=T(x), remove x 
from B""'"" 

{If a connection's bandwidth is now above its limit, 
set the connection's bandwidth to its limit and remove the 
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connection from the set of connections eligible to receive 
more bandwidth. } 

Step 6: 

BW^'^^ =C-^BW(x) 

All X 

{Calculate how much non-allocated bandwidth is still 
left to distribute.} 

Step 7 : 

IF BW^^^^^^T^O THEN GOTO Step 3 

{If there is no unallocated bandwidth remaining, end 
the procedure. If not, perform another loop.} 

There is only one division per iteration with this process 
(step 2), since all of the connections use the same b value. 
Since this process may have at most N (N>1) (the number of 
VLs, more if connections are being considered) iterations 
there can be at most N divisions. Step 3 involves a number 
of multiplications which could be costly. 

One possible alteration to this process would be to cap 
the number of iterations that the process can perform. 
Although the worst case is N iterations, the process may be 
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very close to the final values after many fewer steps. If 
the process were stopped after, e.g., 10 iterations, some of 
the connections would be above their threshold values. 
(Assuming the process is stopped between steps 4 and 5.) 
5 While this is unfair to the other connections it is not going 
to cause any fatal errors. 

As a variation on this process, the b^*^^^^ set could be 
eliminated. Bandwidth would be offered to all of the 
connections on each iteration. While this would eliminate 

10 the trouble of dealing with the b^-^^""^ set, it would cause the 
convergence of the process to slow. This is especially true 
if several connections with large weights have reached their 
limits. They would keep being given large amounts of 
additional bandwidth, which they would keep giving back. 

15 This variation would resemble modification 2 in process B 

(section 6.5.4). In both processes, bandwidth is distributed 
to all of the connections, but then taken away from 
connections that exceed their limits. 



20 6.9 Rate Assignment: Process E 

Described here are two processes that perform the DSM of 
process C. They serve the same purpose as the process D 
subroutines. These two processes have been designed to 
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reduce the number of memory accesses during rate assignment. 
This is done in order to reduce process run-time, thereby 
making a short update interval possible. 

Each of these processes starts with the expansion 
5 coefficients, namely the bi values. If these coefficients 
are not sent by the input ports, then their values must be 
determined. This additional step will increase the time 
needed to run the process unless these values can be 
determined as the data arrives from the input ports. 

10 

6.9.1 Process I 
The first process sorts the set of expansion 
coefficients and then works its way through this sorted list. 
The pseudocode for this process, set forth below, operates as 
15 follows. Although described here in terms of VLs, this 

process would also work for any switch connections or sets of 
connections . 

VLj adds bandwidth at a rate of as the expansion 
factor grows until the VL reaches its limit. Once the VL has 
20 reached its limit, its bandwidth does not increase as b 

grows. Thus, if all of the VLs are below their limit, the 
total amount of bandwidth that has been distributed grows as 
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as b increases. If VL k is the first VL to reach its 

limit, the total bandwidth distributed grows as ^^j-^k ^fter 
VL k reaches its limit until the next VL, say VL 1, reaches 
its limit. At this point, the bandwidth grows as ^^j~^k~^i^ 
5 and so on. If VL k can grow by bk(|)k before it reaches its 
limit, then b can only increase by bk before VL k stops 
accepting more bandwidth. The VL with the smallest bj will 
be the first VL to stop accepting more bandwidth as b grows. 
The VL with the second smallest will be the second, and so 

10 on. The following pseudocode increases b (called b'^°^) until 
either all of the extra bandwidth has been distributed or a 
VL reaches its limit (b"^^*^ reaches the bj value of some VL) . 
In the former case the procedure terminates. In the later 
case the code calculates how quickly the total bandwidth 

15 distributed will be increasing until the next VL reaches its 
limit . 



1 Sort 128 bj values. 

(Rename them bj', bi'<...< b]_28'/ ^^id call their associated 
20 weights g>i\ ... , $?i28'-) {Sort expansion coefficients 

from smallest to largest.} 
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2 = {This is how fast the total bandwidth that has 
been distributed grows as b^*^'^ increases until the first 
VL reaches its limit.} 

3 b^^^=0 {b^^^ starts at zero.} 

5 4 BW=0 {No additional bandwidth has been passed out yet.} 

5 i=0 {Counter starts at zero.} 

6 bo'=0, ^o'=0 {These are needed for steps 10 and 14 to 
work when 1=0} 

7 WHILE BW<C {As long as the additional bandwidth passed 
10 out hasn't exceeded the total amount available to pass 

out^ perform loop below that passes more bandwidth out.} 

8 BEGIN (WHILE LOOP) 

9 i=i+l 

10 Ab=bi'-bi-i' {Amount bandwidth may expand before next VL 
15 reaches its limit.} 

11 BW'=BW+sAb {Total bandwidth distributed increases by sAb 
as long as BW' does not exceed C.} 

12 IF BW'>C {check to see if this would pass out too much 
bandwidth. } 

20 13 THEN Ab-(C-BW)/s, BW-C {BW is too large. Expand by 
this Ab instead to pass out exactly C. } 

14 ELSE BW=BW' {BW is OK so BW==BW' . Reduce s by the (j) 
value of the VL that reached its limit.} 

15 b'^^'^=b"^'^'^-i-Ab {Add additional expansion to running total 
25 of expansion so far. } 

16 END (WHILE LOOP) 
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The sort in line 1 is a costly operation in terms of 
memory accesses. The number of memory accesses for this 
operation should be between Nlog2N and log2N ! It should be 
noted that this sort could be performed as the data arrives 
5 at the processor if data is arriving slowly enough. For 
instance, if data from each input port is read from a 
separate data cellar then each item can be placed in the 
sorted list as the next cell is being read. This reduces the 
time needed to run this process. 

10 

6.9.2 Process II 
As in Process I, this process would also work for any 
switch connections or sets of connections instead of VLs. As 
before, the bandwidth given to VL i grows at a rate of 
15 with respect to b as b is increased until VL i reaches its 
bandwidth limit at bi. After VL i reaches its limit, VL i 
receives no additional bandwidth as b increases above bi. 
Thus, the total additional bandwidth passed out grows as 

with respect to b, where the sum only includes VLs for which 
20 b<bj . Process II divides the b axis into a series of bins^ 

each width Ab, VL i falls into bin k if (k-l) Ab<bi<kAb. (VL 
i is in bin 0 if bi=0 . ) In effect, bi will be approximated by 
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kAb. The value b does not increase continuously in this 
process but rather increases in steps of Ab. As b steps from 
(k-l)Ab to kAb;r the total additional bandwidth passed out to 

all of the VLs is AZ?^^y , where the sum is over VLs in bin k 

or greater. In the pseudocode below, ^k~^^jf where the sum 
is over the VLs in bin k. The amount of additional bandwidth 
distributed by increasing b from (k-l)Ab to kAb is 

(^^y-2c.)A6, where the first sum is over all VLs and the 
second sum is over bins 0 through k-1. 

1 Set Ci=0 in all bins 

2 FOR j = l TO 128, Add (p^ to the c value of bin dictated by 
bj {Lines 1 and 2 find each c^- } •^"Xi^y {Initial 
slope. } 

4 i=0 {Bin counter starts at zero.} 

5 BW=0 {Bandwidth distributed starts at zero,} 

6 s=s-Ci {Slope is reduced by sum of slopes of VLs in bin 
i} 

7 BW=BW+sAb {Bandwidth passed out so far increases by 
sAb. } 

8 IF BW<C 

9 THEN i=i+l, GOTO 6 {Bandwidth has not exceeded limit 
so advance to the next bin. } 

10 ELSE b=iAb, END {Bandwidth too much, so quit.} 
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The FOR loop in line 2 requires a read and a write (read 
the current bin contents, ci, and write the new contents, 
Ci+(Pj) as well as search for the correct bin and an addition. 
5 The size of the bins (Ab) determines how long it will take to 
perform the process and how accurate the approximation is. 

6.10 Rate Assignment: Process F 

Process F is similar in structure to process C. Process 
10 F has a lower computational burden, however. It approximates 
the "Distribute" subroutine of process C using a single 
equation. 

Like process C, process F includes an input node phase 
and an output node phase. Each of these phases utilizes a 

15 series of thresholds. There are two main differences between 
processes C and F. The manner in which the bandwidth is 
distributed to flows has been modified in process F to reduce 
the computational cost of the process. This process uses 
several approximations to the "Distribute" subroutine that 

20 give acceptable results. The other change from process C is 
the threshold values. The threshold values have been 
modified somewhat to enhance performance, reduce the number 
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of operations required, and reduce the amount of 
communications required. 

6.10.1 Input Node Phase 
Each virtual link (or VP or VC or switch connection or 
set of connections) uses two rates to form a pair of 
thresholds. These two thresholds are used to form an 
estimate of the rate that a VL will be able to use in the 
next interval. This estimate takes into account both the 
amount of traffic the VL would like to send and the loading 
of the input node that this traffic must pass through. 
First, every VL x has a minimum guaranteed rate, r^(x). A VL 
will always receive a rate of at least r^(x) if it needs to 
use this much bandwidth. If the weights at a node are scaled 
so that they always sum up to one or less then this 
guaranteed rate is 

r'(x) = ^(x)C (61) 

where C is the capacity of the node. It is also possible to 
set this rate to some other fixed value unrelated to the VL's 
weight. The rate only changes when connections are added and 
deleted from the VL. The second rate for a VL is re- 
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determined for each interval. It is an estimate of the 
bandwidth the VL would like to receive for the next interval. 
For VL X, this desired rate is r'^^^(x). This rate could 
depend on the queue depths and recent arrival patterns of the 
5 connections in the VL as well as other possible information. 
The thresholds are as shown. 

Ti{x)= min{r^^"(x) , r^(x) } 
T2(x)= r^^^(x) 

10 

The input node does not award bandwidth beyond a VL's 
desired rate. The input node uses these thresholds to form 
bandwidth requests, r^^^(x), which are sent to the appropriate 
output nodes. X'^i^y) should be less than or equal to C since 
15 Ti(y) is at most r^(y) and Y,r^(y)<C, If Yj^,(y) is greater 

than C, then an error in CAC (Connection Admissions Control) 
has occurred and too much traffic has been admitted. If 
5]Ti(y)<C<5]T2(y), then each VL gets at least its Ti value. In 

addition, each VL gets additional bandwidth from the 
20 remaining bandwidth pool in proportion to the difference 

between it T2 and Ti thresholds. If X'^2(y)<^' ^^^h VL gets 
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bandwidth equal to its T2 value. Bandwidth is distributed in 
the following manner. 

IF Y.^,{y)>C 

THEN CAC Error {Too much traffic admitted.} 
ELSE IF 2]T2(y)>C 

THEN {In this case ^Tj(y) < C <^T2(y) } 
Tex(x)=T2(x)-Ti(x) for all x 

r'^' (x) = 7; (x) + J"*"^/ fC- Y.T,(y)J ( 62 ) 

{Each VL gets Ti bandwidth and additional bandwidth 
in proportion to T2-T1.} 

ELSE {In this case Yj^2(y)<C} 

r^^^(x) =T2 (x) {Each VL receives its desired rate.} 

Each r^^^(x) value is sent to the appropriate output port. 

6.10.2 Output Node Phase 
Each output node takes the bandwidth requests of all of 
the VLs passing through it and forms a peak cell rate that is 
sent back to the input nodes. The output node thresholds are 
as shown. 
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Ti(x)= min{r^^^(x) , r^(x) } 
T2(x)= r^^^(x) 



5 The peak rate for each VL can be found using the pseudocode 
shown below. This is similar to the Input node phase except 

when ^T2(y)<C. In this case, each VL receives T2 and the 

remaining bandwidth is distributed in proportion to the (|)jS. 
This gives VLs bandwidth in excess of r^^^. This serves to 
10 overbook the inputs as discussed earlier. 

IF 'Z^,(y)>C 
THEN CAC Error {Too much traffic admitted.} 
ELSE IF Y.T^,(y)>C 

15 THEN {In this case 2]Ti(y) < C <^T2(y) } 

Tex(x)=T2(x)-Ti(x) for all x 

r'"'(x) = T,(x) + ^^rC-ZTMJ (63) 

{Each VL gets Ti bandwidth and additional bandwidth 
in proportion to T2-T1.} 
20 ELSE (In this case 2^T2(y)<C.} 

T^^(x)=C-T2(x) for all x 
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r'^Ux)=T,(x) + 



(64) 



{This insures 



Yf^^ for this case.} 



In the foregoing process, all of the output port bandwidth is 
assigned. Note that in equations 62 and 63, the quantity 

^5^=:; [C- 2^T^(y)] is the same for all x. This quantity only 

needs to be scaled by Tex(x) for each VL. In equation 64, 
=r- — {C-^^T^iy)] is the same for each VL, 



Process F depends on equations 62, 63, and 64 to produce 
good results. These equations can be viewed as members of a 
larger class of equations. In fact, it may be advantageous 
to replace one or more of them with another member of its 
class- In general, if the rates fall between two sets of 
thresholds, Tj and Tj+i, then the bandwidth may be distributed 
using the general rule 



TJx) = T.Jx)-Tj(x) 



(65) 



r(x) = T.(x) + 



r(x)K(x) 



[C-Y,T^(y)] 



(66) 



ILr(y)rjy) 
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Note the m and n factors in this equation. Also note that 
each link will receive a rate of at least Tj (x) and the sum 
of the rates will equal C. The properties of the 
distribution scheme will vary depending of the values of the 
exponents m and n. 

The m=l, n=0 rule is set forth in process D (equation 
66), section 6.8.3. This rule is fair in the sense that 
every link will receive extra bandwidth in proportion to its 
weight. The drawback of this rule is that the Tj+i(x) 
thresholds do not come into play. In other words, some r(x) 
may exceed Tj+i(x) even though other links may be below their 
(j+1) limit. This is especially true for links with small Tex 
values and larger weights. 

The m=0, n=l rule has the desirable property that no 
connection will exceed its Tj+i threshold. (To see this note 

that /C-X^J/S^^ -1 • ) distribution of bandwidth is not 

fair, however, because bandwidth is awarded proportionally to 
the Tex values, as opposed to the weights. Thus, links with 
large Tex values or small weights will receive bandwidth than 
they deserve. 

One compromise between these two rules would be m=l, 
n=l. While this rule would not be perfectly fair and the 
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rates may exceed the j-t-1 threshold, the m=l advantages for 
small Tex values and large weights may be counterbalanced by 
the n=l advantages for large Tex values and small weights. 
The optimal values of m and n and may depend on the traffic 
may not be integers . 

In the rule for m=-l, n=l, since r(x)=b^(x), this rule 
corresponds to distributing extra bandwidth based on how much 
each link may expand before it reaches the next threshold. 
Other rules are also possible. One class of rules has the 
form 

While this process may be less fair than process C, it 
is reasonably fair. The number of operations is reduced, 
which should make a shorter update interval possible- In 
addition, only a single piece of information is being passed 
from the input port to the output port, so the communications 
bandwidth is reduced. 
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7 . 0 Architecture 

Fig. 25 shows a device 120 (e.g., a computer, switch, 
router, etc.) for performing the processes of sections 1 to 6 
above. Device 120 includes a processor 121, a memory 122, 
and a storage medium 124, e.g,, a hard disk (see view 125). 
Storage medium 124 stores computer-executable instructions 
126 for performing the processes of sections 1 to 6. 
Processor 121 executes computer-executable instructions 
(computer software) 126 out of memory 122 to perform the 
processes of sections 1 to 6. 

The processes of sections 1 to 6, however, are not 
limited to use with the hardware/software configuration of 
Fig. 25; they may find applicability in any computing or 
processing environment. The processes of sections 1 to 6 may 
be implemented in hardware, software, or a combination of the 
two (e.g., using an ASIC (application-specific integrated 
circuit) or programmable logic) . The processes of sections 1 
to 6 may be implemented in one or more computer programs 
executing on programmable computers that each includes a 
processor, a storage medium readable by the processor 
(including volatile and non-volatile memory and/or storage 
elements), at least one input device, and one or more output 
devices. Program code may be applied to data entered to 
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perform the processes of sections 1 to 6 and to generate 
output information. 

Each such program may be implemented in a high level 
procedural or object-oriented programming language to 
5 communicate with a computer system. However, the programs 
can be implemented in assembly or machine language. The 
language may be a compiled or an interpreted language. 

Each computer program may be stored on a storage medium 
or device (e.g., CD-ROM, hard disk, or magnetic diskette) 
J'^IO that is readable by a general or special purpose programmable 
computer for configuring and operating the computer when the 
storage medium or device is read by the computer to perform 
the processes of sections 1 to 6. The processes of sections 
yi 1 to 6 may also be implemented as a computer-readable storage 

4^5 medium, configured with a computer program, where, upon 
O execution, instructions in the computer program cause the 

computer to operate in accordance with the processes. 

Other embodiments not described herein are also within 
the scope of the following claims. For example, any one or 
20 more facets of the processes of sections 1 to 6 may be 
combined, resulting in a new process. 
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WHAT IS CLAIMED IS : 

1. A method of allocating bandwidth to data traffic 
flows for transfer through a network device, comprising: 

allocating bandwidth to a committed data traffic flow 
5 based on a guaranteed data transfer rate and a queue size of 
the committed data traffic flow in the network device; and 

allocating bandwidth to uncommitted data traffic flows 
using a weighted maximum/minimum process. 

§110 2. The method of claim 1, wherein the weighted 

H maximum/minimum process allocates bandwidth to the 

^ uncommitted data traffic flows in proportion to weights 

^ associated with the uncommitted data traffic flows. 

S 15 3. The method of claim 2, wherein the weighted 

maximum/minimum process increases bandwidth to the 
uncommitted data traffic flows in accordance with the weights 
associated with the uncommitted data traffic flows until at 
least one of the uncommitted data traffic flows reaches a 
20 maximum bandwidth allocation. 
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4. The method of claim 3, wherein the weighted 
maximum/minimum process allocates remaining bandwidth to 
remaining uncommitted data traffic flows based on weights 
associated with the remaining uncommitted data traffic flows. 

5. The method of claim 1, wherein the bandwidth 
comprises data cell slots, 

6. The method of claim 1, wherein the bandwidth is 
allocated to the data traffic flows in discrete time 
intervals . 

7. A method of allocating bandwidth to data flows 
passing through a network device^, each of the data flows 
having an associated weight, comprising: 

increasing an amount of bandwidth to the data flows in 
proportion to the weights of the data flows until one port 
through the network device reaches a maximum value; 

freezing the amounts of bandwidth allocated to the data 
flows in the one port; and 

increasing the amount of bandwidth to remaining data 
flows passing through the network device in proportion to the 
weights of the remaining data flows. 
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8. The method of claim 1, further comprising: 
increasing the amount of bandwidth to the remaining data 
flows until another port through the network device reaches a 
5 maximum value; 

freezing the amounts of bandwidth allocated to the data 
flows in the other port; and 

increasing the amount of bandwidth to remaining data 
flows passing through the network device in proportion to the 
^ 10 weights of the remaining data flows. 



9. 



The method of claim 1 , further comprising assigning 



one or more of the data flows a minimum bandwidth, wherein 



the amount of bandwidth allocated to the one or more data 



15 



flows is increased relative to the minimum bandwidth. 



10. The method of claim 1, wherein the bandwidth is 



allocated to the data flows in discrete time intervals. 



20 



11. 



A method of allocating bandwidth to data flows 



passing through a network device, comprising: 



allocating a predetermined amount of bandwidth to one or 



more of the data flows; and 
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distributing remaining bandwidth to remaining data 
flows . 

12. The method of claim 11, wherein the remaining 

5 bandwidth is distributed to the remaining data flows using a 
weighted maximum/minimum process, 

13. The method of claim 12, wherein the weighted 
maximum/minimum process comprises: 

10 increasing an amount of bandwidth to the remaining data 

flows in proportion to weights associated with the remaining 
data flows until one port through the network device reaches 
a maximum value. 



15 14. The method of claim 13, wherein the weighted 

maximum/minimum process further comprises: 

freezing the amounts of bandwidth allocated to the 
remaining data flows in the one port; and 

increasing the amount of bandwidth to still remaining 
20 data flows passing through the network device in proportion 
to weights of the still remaining data flows. 
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15. A method of allocating bandwidth to data flows 
passing through a network device, comprising: 

determining a character of the data flows; and 
allocating bandwidth to the data flows in accordance 

with the character of the data flows; 

wherein the bandwidth is allocated to data flows 

according to which data flows have a highest probability of 

using the bandwidth. 

16. The method of claim 15, wherein the character of 
the data flows includes peak cell rate, likelihood of bursts, 
and/or average cell rate. 

17. A method of allocating bandwidth to data flows 
passing through a network device, comprising: 

allocating the bandwidth using a weighted 
maximum/minimum process. 

18. The method of claim 17, wherein the weighted 
maximum/minimum process comprises: 

assigning weights to the data flows; and 
allocating the bandwidth to the data flows according to 
the weights. 
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19. The method of claim 18, wherein allocating the 
bandwidth according to the weights comprises: 

increasing an amount of bandwidth allocated to each data 
5 flow in proportion to a weight assigned to the data flow; and 

freezing the amount of bandwidth allocated to a data 
flow when either (i) an input port or an output port of the 
network device reaches a maximum utilization, or (ii) the 
data flow reaches a maximum bandwidth. 

10 

20. The method of claim 19, further comprising: 
increasing an amount of bandwidth to remaining data 

flows passing through the network device until either (i) 
another input port or output port of the network device 

15 reaches a maximum utilization, or (ii) one of the remaining 
data flows reaches a maximum bandwidth; 

freezing an amount of bandwidth allocated to the 
remaining data flow that has reached a maximum bandwidth or 
to the remaining data flow passing through an input or output 

20 port reached that has reached a maximum utilization; and 

increasing the amount of bandwidth to still remaining 
data flows passing through the network device in proportion 
to weights associated with the remaining data flows. 
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21. The method of claim 20, wherein, after all of the 
data flows passing through the network device are frozen, the 
method further comprises: 

5 distributing remaining bandwidth at an output port to 

data flows passing through the output port. 

22, The method of claim 20, wherein, after all of the 
data flows passing through the network device are frozen, the 

10 method further comprises: 

distributing remaining bandwidth at an output port to 
data flows passing through the output port in proportion to 
weights of the data flows passing through the output port. 

15 23. The method of claim 20, wherein, after all of the 

data flows passing through the network device are frozen, the 
method further comprises: 

distributing remaining bandwidth at an output port to 
data flows passing through the output port according to which 

20 data flows have a highest probability of using the bandwidth. 

24. The method of claim 17, wherein the bandwidth is 
allocated in discrete time intervals. 
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25. A method of allocating bandwidth to data flows 
through a network device, comprising: 

allocating bandwidth to the data flows using a weighted 
5 max/min process; 

wherein an amount of bandwidth allocated to data flows 
passing through an input port of the network device is 
greater than an amount of data that can pass through the 
input port of the network device. 

10 

26. A method of allocating bandwidth to data flows 
passing through a network device^ comprising: 

allocating bandwidth to data flows passing through input 
ports of the network device using a weighted max/min process. 

15 

27. The method of claim 26, wherein allocating the 
bandwidth comprises : 

increasing bandwidth allocated to data flows passing 
through each input port in proportion to a weight assigned to 
20 the data flow passing through the input port; and 

freezing an amount of bandwidth allocated to a data flow 
passing through an input port when either (i) the input port 
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reaches a maximum utilization, or (ii) the data flow reaches 
a maximum bandwidth. 

28, The method of claim 27, further comprising: 
5 continuing to increase the bandwidth allocated to non- 

frozen data flows in proportion to weights of the data flows 
until an amount of bandwidth is frozen at all of the data 
flows - 

10 29. A method of allocating bandwidth to data flows 

through a network device, comprising: 

allocating bandwidth to the data flows passing through 
output ports of the network device using a weighted max/min 
process . 

15 

30. The method of claim 29, wherein allocating the 
bandwidth comprises: 

increasing an amount of bandwidth allocated to data 
flows passing through each output port in proportion to a 
20 weight assigned to a data flow passing through an output 
port; and 

freezing the amount of bandwidth allocated to the data 
flow passing through the output port when either (i) the 
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output port reaches a maximum utilization, or (ii) the data 
flow reaches a maximum bandwidth. 

31. The method of claim 30, further comprising: 
continuing to increase the amount of bandwidth allocated 

to non-frozen data flows in proportion to weights of the data 
flows until the amount of bandwidth allocated to all data 
flows is frozen. 

32. The method of claim 31, wherein maximum values 
assigned to each data flow are based on the bandwidth 
allocations . 

33. The method of claim 30, wherein, after the amount 
of bandwidth assigned to all output ports is frozen, the 
method further comprises: 

distributing remaining bandwidth at an output port to 
data flows passing through the output port. 

34. The method of claim 30, wherein, after the amount of 
bandwidth assigned to all output ports is frozen, the method 
further comprises: 
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distributing remaining bandwidth at an output port to 
data flows passing through the output port in proportion to 
weights of the data flows. 

5 35. The method of claim 30, wherein after all of the 

data flows passing through the network device are frozen, the 
method further comprises: 

distributing remaining bandwidth at an output port to 
data flows passing through the output port according to which 

10 data flows have a highest probability of using the bandwidth. 

36. The method of claim 26, wherein the bandwidth is 
allocated in discrete time intervals. 



15 37. The method of claim 26, further comprising: 

allocating bandwidth to committed data traffic based on 
a guaranteed data transfer rate. 



38, The method of claim 37, wherein bandwidth is 
20 allocated to the committed data traffic in response to a 

request for bandwidth such that any request that is less than 
or equal to the guaranteed data transfer rate is granted. 
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39. The method of claim 26, wherein: 

the bandwidth is allocated to uncommitted data traffic 
and, for committed data traffic, bandwidth is allocated based 
on a guaranteed transfer rate; and 
5 remaining bandwidth, not allocated to the committed data 

traffic, is allocated to the uncommitted data traffic. 

40. The method of claim 19, further comprising: 
allocating a predetermined amount of bandwidth to one or 

10 more of the data flows; and 

distributing remaining bandwidth to non-frozen remaining 
data flows by: 

increasing an amount of bandwidth allocated to each 
remaining data flow in proportion to a weight assigned 
15 to the remaining data flow; and 

freezing the amount of bandwidth allocated to a 
remaining data flow when either (i) an input port or an 
output port of the network device reaches a maximum 
utilization, or (ii) the remaining data flow reaches a 
20 maximum bandwidth . 

41. The method of claim 37, wherein bandwidth is 
allocated to the committed data traffic in response to a 
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request for bandwidth such that any request that is greater 
than the guaranteed data transfer rate is granted at the 
guaranteed rate, 

5 42, An apparatus for allocating bandwidth to data 

traffic flows through the apparatus, the apparatus comprising 
circuitry which: 

allocates bandwidth to a committed data traffic flow 
based on a guaranteed data transfer rate and a queue size of 
10 the committed data traffic flow in the apparatus; and 

allocates bandwidth to uncommitted data traffic flows 
using a weighted maximum/minimum process . 



43. The apparatus of claim 41, wherein the weighted 
15 maximum/minimum process allocates bandwidth to the 

uncommitted data traffic flows in proportion to weights 
associated with the uncommitted data traffic flows. 



44. The apparatus of claim 43, wherein the weighted 
20 maximum/minimum process increases bandwidth to the 

uncommitted data traffic flows in accordance with the weights 
associated with the uncommitted data traffic flows until at 
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least one of the uncommitted data traffic flows reaches a 
maximum bandwidth allocation. 

45. The apparatus of claim 44, wherein the weighted 
maximum/minimum process allocates remaining bandwidth to 
remaining uncommitted data traffic flows based on weights 
associated with the remaining uncommitted data traffic flows. 

46. The apparatus of claim 42, wherein the bandwidth 
comprises data cell slots, 

47. The apparatus of claim 42, wherein the bandwidth is 
allocated to the data traffic flows in discrete time 
intervals . 

48. An apparatus for allocating bandwidth to data flows 
passing through the apparatus, each of the data flows having 
an associated weight, the apparatus comprising circuitry 
which: 

increases an amount of bandwidth to the data flows in 
proportion to the weights of the data flows until one port 
through the apparatus reaches a maximum value; 
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freezes the amounts of bandwidth allocated to the data 
flows in the one port; and 

increases the amount of bandwidth to remaining data 
flows passing through the apparatus in proportion to the 
weights of the remaining data flows. 

49. The apparatus of claim 48, wherein the circuitry: 
increases the amount of bandwidth to the remaining data 

flows until another port through the apparatus reaches a 
maximum value; 

freezes the amounts of bandwidth allocated to the data 
flows in the other port; and 

increases the amount of bandwidth to remaining data 
flows passing through the apparatus in proportion to the 
weights of the remaining data flows. 

50. The apparatus of claim 48, wherein the circuitry 
assigns one or more of the data flows a minimum bandwidth, 
wherein the amount of bandwidth allocated to the one or more 
data flows is increased relative to the minimum bandwidth. 

51. The apparatus of claim 48, wherein the bandwidth is 
allocated to the data flows in discrete time intervals. 
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52. An apparatus for allocating bandwidth to data flows 
passing through the apparatus, the apparatus comprising 
circuitry which: 

allocates a predetermined amount of bandwidth to one or 
more of the data flows; and 

distributes remaining bandwidth to remaining data flows. 

53. The apparatus of claim 52, wherein the remaining 
bandwidth is distributed to the remaining data flows using a 
weighted maximum/minimum process. 

54. The apparatus of claim 52, wherein the weighted 
maximum/minimum process comprises: 

increasing an amount of bandwidth to the remaining data 
flows in proportion to weights associated with the remaining 
data flows until one port through the apparatus reaches a 
maximum value. 

55. The apparatus of claim 53, wherein the weighted 
maximum/minimum process further comprises: 

freezing the amounts of bandwidth allocated to the 
remaining data flows in the one port; and 
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increasing the amount of bandwidth to still remaining 
data flows passing through the apparatus in proportion to 
weights of the still remaining data flows. 

56. A apparatus for allocating bandwidth to data flows 
passing through the apparatus, the apparatus comprising 
circuitry which: 

determines a character of the data flows; and 

allocates bandwidth to the data flows in accordance with 

the character of the data flows; 

wherein the bandwidth is allocated to data flows 

according to which data flows have a highest probability of 

using the bandwidth. 

57. The apparatus of claim 56, wherein the character of 
the data flows includes peak cell rate, likelihood of bursts, 
and/or average cell rate. 

58. An for allocating bandwidth to data flows passing 
through the apparatus, the apparatus comprising circuitry 
which: 

allocates the bandwidth using a weighted maximum/minimum 
process . 
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59. The apparatus of claim 58, wherein the weighted 
maximum/minimum process comprises: 

assigning weights to the data flows; and 
allocating the bandwidth to the data flows according to 
the weights. 

60. The apparatus of claim 59, wherein allocating the 
bandwidth according to the weights comprises: 

increasing an amount of bandwidth allocated to each data 
flow in proportion to a weight assigned to the data flow; and 

freezing the amount of bandwidth allocated to a data 
flow when either (i) an input port or an output port of the 
apparatus reaches a maximum utilization, or (ii) the data 
flow reaches a maximum bandwidth. 

61. The apparatus of claim 60, wherein the circuitry: 
increases an amount of bandwidth to remaining data flows 

passing through the apparatus until either (i) another input 
port or output port of the apparatus reaches a maximum 
utilization, or (ii) one of the remaining data flows reaches 
a maximum bandwidth; 
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freezes an amount of bandwidth allocated to the 
remaining data flow that has reached a maximum bandwidth or 
to the remaining data flow passing through an input or output 
port reached that has reached a maximum utilization; and 
5 increases the amount of bandwidth to still remaining 

data flows passing through the apparatus in proportion to 
weights associated with the remaining data flows. 

62. The apparatus of claim 61, wherein, after all of 
the data flows passing through the apparatus are frozen, the 
circuitry distributes remaining bandwidth at an output port 
to data flows passing through the output port. 

63. The apparatus of claim 61, wherein, after all of 
the data flows passing through the apparatus are frozen, the 
circuitry distributes remaining bandwidth at an output port 
to data flows passing through the output port in proportion 
to weights of the data flows passing through the output port. 

20 64. The apparatus of claim 61, wherein, after all of 

the data flows passing through the apparatus are frozen, the 
circuitry distributes remaining bandwidth at an output port 
to data flows passing through the output port according to 
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which data flows have a highest probability of using the 
bandwidth. 

65. The apparatus of claim 58, wherein the bandwidth is 
allocated in discrete time intervals. 

66. An apparatus for allocating bandwidth to data flows 
through the apparatus, the apparatus comprising circuitry 
which: 

allocates bandwidth to the data flows using a weighted 
max/min process; 

wherein an amount of bandwidth allocated to data flows 
passing through an input port of the apparatus is greater 
than an amount of data that can pass through the input port 
of the apparatus. 

67. An apparatus for allocating bandwidth to data flows 
passing through the apparatus, the apparatus comprising 
circuitry which: 

allocates bandwidth to data flows passing through input 
ports of the apparatus using a weighted max/min process. 
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68. The apparatus of claim 67, wherein allocating the 
bandwidth comprises: 

increasing bandwidth allocated to data flows passing 
through each input port in proportion to a weight assigned to 
the data flow passing through the input port; and 

freezing an amount of bandwidth allocated to a data flow 
passing through an input port when either (i) the input port 
reaches a maximum utilization, or (ii) the data flow reaches 
a maximum bandwidth. 

69. The apparatus of claim 68, wherein the circuitry: 
continues to increase the bandwidth allocated to non- 
frozen data flows in proportion to weights of the data flows 
until an amount of bandwidth is frozen at all of the data 
flows - 

70. An apparatus for allocating bandwidth to data flows 
through the apparatus, the apparatus comprising circuitry 
which: 

allocates bandwidth to the data flows passing through 
output ports of the apparatus using a weighted max/min 
process . 
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71. The apparatus of claim 70, wherein allocating the 
bandwidth comprises: 

increasing an amount of bandwidth allocated to data 
flows passing through each output port in proportion to a 
weight assigned to a data flow passing through an output 
port; and 

freezing the amount of bandwidth allocated to the data 
flow passing through the output port when either (i) the 
output port reaches a maximum utilization, or (ii) the data 
flow reaches a maximum bandwidth. 

72. The apparatus of claim 71, wherein the circuitry: 
continues to increase the amount of bandwidth allocated 

to non-frozen data flows in proportion to weights of the data 
flows until the amount of bandwidth allocated to all data 
flows is frozen. 

73. The apparatus of claim 72, wherein maximum values 
assigned to each data flow are based on the bandwidth 
allocations . 

74. The apparatus of claim 71, wherein, after the 
amount of bandwidth assigned to all output ports is frozen, 
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the apparatus distributes remaining bandwidth at an output 
port to data flows passing through the output port. 

75. The apparatus of claim 71, wherein, after the 
5 amount of bandwidth assigned to all output ports is frozen, 
the apparatus distributes remaining bandwidth at an output 
port to data flows passing through the output port in 
proportion to weights of the data flows. 

10 76. The apparatus of claim 71, wherein after all of the 

data flows passing through the apparatus are frozen, the 
apparatus distributes remaining bandwidth at an output port 
to data flows passing through the output port according to 
which data flows have a highest probability of using the 

15 bandwidth. 

77. The apparatus of claim 26, wherein the bandwidth is 
allocated in discrete time intervals. 

20 78. The apparatus of claim 70, wherein the circuitry: 

allocates bandwidth to committed data traffic based on a 
guaranteed data transfer rate. 
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79. The apparatus of claim 78, wherein bandwidth is 
allocated to the committed data traffic in response to a 
request for bandwidth such that any request that is less than 
or equal to the guaranteed data transfer rate is granted. 

5 

80. The apparatus of claim 70, wherein: 

the bandwidth is allocated to uncommitted data traffic 
and, for committed data traffic, bandwidth is allocated based 
on a guaranteed transfer rate; and 
10 remaining bandwidth, not allocated to the committed data 

traffic, is allocated to the uncommitted data traffic. 

81. The apparatus of claim 60, wherein the circuitry: 
allocates a predetermined amount of bandwidth to one or 

15 more of the data flows; and 

distributes remaining bandwidth to non-frozen remaining 
data flows by: 

increasing an amount of bandwidth allocated to each 
remaining data flow in proportion to a weight assigned 
20 to the remaining data flow; and 

freezing the amount of bandwidth allocated to a 
remaining data flow when either (i) an input port or an 
output port of the apparatus reaches a maximum 
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utilization, or (ii) the remaining data flow reaches a 
maximum bandwidth. 

82. The apparatus of claim 78, wherein bandwidth is 
5 allocated to the committed data traffic in response to a 

request for bandwidth such that any request that is greater 
than the guaranteed data transfer rate is granted at the 
guaranteed rate. 

10 83. A method of transferring data traffic 

flows through a network device, comprising 

transferring a committed data traffic flow through the 
network device using a guaranteed bandwidth; 

determining an amount of bandwidth that was used during 
15 a previous data traffic flow transfer; and 

allocating bandwidth in the network device to 
uncommitted data traffic flows based on the amount of 
bandwidth that was used during the previous data traffic flow 
transfer . 

20 

84. The method of claim 83, wherein allocating 
comprises : 



-175- 



Docket No.: 06269/020001 

determining a difference between the amount of bandwidth 
that was used during the previous data traffic flow transfer 
and an amount of available bandwidth; and 

allocating the difference in bandwidth to the 
5 uncommitted data traffic flows. 



85, An apparatus for transferring data traffic flows 
through the apparatus, the apparatus comprising circuitry 
which: 

10 transfers a committed data traffic flow through the 

apparatus using a guaranteed bandwidth; 

determines an amount of bandwidth that was used during a 
previous data traffic flow transfer; and 

allocates bandwidth in the apparatus to uncommitted data 
15 traffic flows based on the amount of bandwidth that was used 
during the previous data traffic flow transfer. 



86. The apparatus of claim 85, wherein allocating 
comprises : 

20 determining a difference between the amount of bandwidth 

that was used during the previous data traffic flow transfer 
and an amount of available bandwidth; and 
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allocating the difference in bandwidth to the 
uncommitted data traffic flows. 

86. The apparatus of any of claims 42, 48, 52, 56, 58, 
66, 67, 70 and 85, wherein the circuitry comprises: 

a memory which stores a computer program; and 
a processor which executes the computer program. 

87. The apparatus of any of claims 42, 48, 52, 56, 58, 
66, 67, 70 and 85, wherein the circuitry comprises discrete 
hardware elements and/or programmable logic, 

88. A computer program stored on a computer-readable 
medium for allocating bandwidth to data traffic flows for 
transfer through a network device, the computer program 
comprising instructions that cause a computer to: 

allocate bandwidth to a committed data traffic flow 
based on a guaranteed data transfer rate and a queue size of 
the committed data traffic flow in the network device; and 

allocate bandwidth to uncommitted data traffic flows 
using a weighted maximum/minimum process. 
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89. A computer program stored on a computer-readable 
medium for allocating bandwidth to data flows passing through 
a network device, each of the data flows having an associated 
weight, the computer program comprising instructions that 
cause a computer to: 

increase an amount of bandwidth to the data flows in 
proportion to the weights of the data flows until one port 
through the network device reaches a maximum valuer- 
freeze the amounts of bandwidth allocated to the data 
flows in the one port; and 

increase the amount of bandwidth to remaining data flows 
passing through the network device in proportion to the 
weights of the remaining data flows. 

90. A computer program stored on a computer-readable 
medium for allocating bandwidth to data flows passing through 
a network device, the computer program comprising 
instructions that cause the computer to: 

allocate a predetermined amount of bandwidth to one or 
more of the data flows; and 

distribute remaining bandwidth to remaining data flows. 
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91. A computer program stored on a computer-readable 
medium for allocating bandwidth to data flows passing through 
a network device, the computer program comprising 
instructions that cause the computer to: 
5 determine a character of the data flows; and 

allocate bandwidth to the data flows in accordance with 
the character of the data flows; 

wherein the bandwidth is allocated to data flows 
according to which data flows have a highest probability of 
10 using the bandwidth. 



92 . A computer program stored on a computer-readable 
medium for allocating bandwidth to data flows passing through 
a network device, the computer program comprising 
15 instructions that cause the computer to: 

allocate the bandwidth using a weighted maximum/minimum 
process . 



93. A computer program stored on a computer-readable 
20 medium for allocating bandwidth to data flows through a 

network device, the computer program comprising instructions 
that cause the computer to: 



-179- 



Docket No.: 06269/020001 

allocate bandwidth to the data flows using a weighted 
max/min process; 

wherein an amount of bandwidth allocated to data flows 
passing through an input port of the network device is 
5 greater than an amount of data that can pass through the 
input port of the network device. 

94. A computer program stored on a computer-readable 
medium for allocating bandwidth to data flows passing through 
10 a network device, the computer program comprising 
instructions that cause the computer to: 

allocate bandwidth to data flows passing through input 
ports of the network device using a weighted max/min process, 

15 95. A computer program stored on a computer-readable 

medium for allocating bandwidth to data flows through a 
network device, the computer program comprising instructions 
that cause the computer to: 

allocate bandwidth to the data flows passing through 

20 output ports of the network device using a weighted max/min 
process . 
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96. A computer program stored on a computer-readable 
medium for transferring data traffic flows through a network 
device, the computer program comprising instructions that 
cause a computer to: 
5 transfer a committed data traffic flow through the 

network device using a guaranteed bandwidth; 

determine an amount of bandwidth that was used during a 
previous data traffic flow transfer; and 

allocate bandwidth in the network device to uncommitted 
10 data traffic flows based on the amount of bandwidth that was 
used during the previous data traffic flow transfer. 
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ALLOCATING NETWORK BANDWIDTH 
ABSTRACT 

In allocating bandwidth to data for transfer through a 
network device, bandwidth is allocated to committed data 
traffic based on a guaranteed data transfer rate and a queu 
size of the network device, and bandwidth is allocated to 
uncommitted data traffic using a weighted maximum/minimum 
process. The weighted maximum/minimum process allocates 
bandwidth to the uncommitted data traffic in proportion to 
weight associated with the uncommitted data traffic. 
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1001 of Title 1 8 of the United States Code and that such willful false statements may jeopardize the validity of the 
application or any patents issued thereon. 
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APPENDIX 



input = 0; 
output = 1; 
5 for (i = 0; i < num_inputs; i++) { 

C [ input ][i] = input__capacity [i] ; 
for (j = 0; j < num_outputs; j++) 

C[ input] [i] -= demanded_rates [i] [j];} 
for (j = 0; j < num_outputs; j++) { 
10 C[output][j] - output_capacity [ j ] ; 

for (i = 0; i < num_inputs; iH-+) 

C [output] [j] -= demanded_rates [i] [ j ] ; } 
for (j = 0; j < num_outputs; j-f-f-) 
X [output] [j] = 0.0; 
15 for (i = 0; i < num_inputs; i++) 

for (j = 0; j < nuin_outputs; j++) 

D[i][j] = desired_rates [i] [ j ] ; 
for (k = 1; k <= num__global_iterations; k++) { 
for (i = 0; i < num_inputs; i++) { 
20 for ( j = 0; j < num_outputs; j++) { 

w [ j ] = weights [i] [ j ] ; 
d[j] = D[i] [j];} 
x[input][i] = dist (num_inputs, d, C[input][i], 

r) ; 

25 if (k == num__global_iterations) 

for (j = 0; j < num_outputs; j++) 
requested_rates [i] [j ] = r[j]; 
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else{ 

freeze = 1; 

if (x[input] [i] != inf ) 

for (j = 0; j < num_outputs; 
5 if {x[output] [j] != inf 

&& X [ input ][i] > X [output ] [j ] ) 
freeze = 0; 
if (freeze) { 

X [input] [i] = inf ; 
10 for (j = 0; j < num_outputs; j++) 

D[i] [j] = r[j];}}} 
if (k < num_global_iterations) { 
for (j = 0; j < num__outputs; j++) { 

for (i = 0; i < num_inputs; i++) { 
15 w[i] = weights [i] [j ] ; 

d[i] - D[i] [j];} 
x[output][j] = dist (num__outputs, w, d, 
C [output] [ j ] , r) ; 

freeze = 1; 

20 if (x[output] [j] != inf) 

for (i = 0; i < num__inputs; i++) 
if (x[input] [i] != inf 
&& X [output] [j] > x[ input] [i] ) 
freeze = 0; 
25 if (freeze) { 

x[output][j] = inf; 
for (i = 0; i < num_inputs; i++) 
D[i] [j] = r[i];}}}} 
for ( j = 0; j < num_outputs; { 
30 for (i == 0; i < num__inputs ; i++) { 
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w[i] = weights[i] [j]; 

d[i] = requesteci_rates [i] [ j ] ; } 
dist (num_outputs, w, d, C [output] [j ] , r) ; 
for (i = 0; i < num__inputs; iH-+) 
5 allocated_rates [i] [ j ] = r[i];} 
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